OpenSAFELY: Onboarding new users to OpenSAFELY

[email protected]

This technical document describes our approach to the first wave of additional pilot users for OpenSAFELY, a new open source platform for EHR research with unprecedented security and support for open science. OpenSAFELY can currently execute analysis code across primary care EHR data for over 58 million patients in TPP and EMIS, linked onto SUS, ONS, ICNARC, ECDS, SGSS and other datasets. You can read more about the technical and security model in our papers and in our technical manuals which are shared online here.

Onboarding new users to OpenSAFELY is more complex than granting a simple data download, or a login and password, because, for security and privacy reasons, OpenSAFELY is very different to other approaches for EHR data analysis. The platform does not give researchers unconstrained access to view large volumes of pseudonymised and disclosive patient data, either via download or via a remote desktop. Instead we have produced a series of open source tools that enable researchers to use flexible, pragmatic, but standardised approaches to process raw electronic health records data into “research ready” datasets, and to check that this has been done correctly, without needing to access the patient data directly. Using this data management framework we also generate bespoke dummy datasets. These dummy datasets are used by researchers to develop analysis code in the open, using GitHub. When their data management and data analysis scripts are capable of running to completion, and passing all tests in the OpenSAFELY framework, they are finally sent through to be executed against the real data inside the secure environment, using the OpenSAFELY jobs runner, inside a container using Docker, without the researcher needing access to that raw potentially disclosive pseudonymised data themselves. The non-disclosive summary results, output tables, logs, and graphs are then manually reviewed, as in other systems, before release.

This new and highly secure approach afforded by the OpenSAFELY platform has meant NHS England (the data controller) has been able to provide access, cautiously, to an unprecedented scale of data for Covid-19 related analyses. It does, however, mean that new users currently need strong computational data science skills, beyond those needed to write statistical analysis scripts against local data in Python, R or Stata. However, our approach also means that every analysis executed on OpenSAFELY automatically contributes to a growing library of re-usable codelists, variables, and code, rather than unpredictable folders of arbitrary data management and analysis scripts in Python, R and SQL. Furthermore it allows us to efficiently address important strategic and analytic challenges such as minimally disclosive linkage, federated analysis, automated monitoring of disclosiveness, and transparent reporting on all analyses. Lastly, alongside this development work, we are also beginning to implement OpenSAFELY in other environments and against new datasets, so that others can take advantage of the benefits of “curation as you go”, our privacy augmentation, and our shared open source codebase.

Next steps

We have set out to create an open source platform for EHR research where researchers can work independently, using the tools and data (with appropriate permissions) without necessarily needing substantial engagement with our team. Increasingly we will also produce tools that do not require advanced data science skills, to support a wider range of users.

As part of building that resource for the community, we are currently working with NHS England to cautiously on-board a small number of external pilot users to develop their analyses on OpenSAFELY. This first wave of pilot users will be collaborators, working closely alongside us to co-develop the platform. Because OpenSAFELY is more than a simple data download or remote desktop service, this first wave of external pilot users need to have substantial existing computational data science skills, and strong experience of working with primary care electronic health record data. They must also be keen to work closely with us to co-develop the OpenSAFELY platform for the community, as described in our Principles of OpenSAFELY document. In turn, as with all those using and contributing to the platform, they will share the credit as the tools, documentation, codebase and codelists are increasingly widely used. During this pilot phase we will be developing our software and approach for external users.

Below we have posted a brief summary of what we can offer to our initial wave of pilot users, and what we would like back from them to support the growth of this open platform. This list is aimed to facilitate discussions as we move cautiously forward to identify new users and collaborators to deliver a thriving open source ecosystem for computational data science on electronic health records, alongside high quality research during the pandemic. Because of limited financial resources, and the need for caution in the pilot phase of this work, we can only accommodate a small number of users until we expand our team. Please note all potential new users and their analysis proposals will also need to be discussed with NHS England who are the Data Controller.

What we can offer to pilot OpenSAFELY users

Access to the platform and, more specifically: help in working with NHSX and NHS England to determine what level of access you will need to the data, and how this can best be facilitated.
Guidance and support on using the OpenSAFELY cohort extractor, OpenCodelists and OpenSAFELY jobs runner to deliver analyses. To help us manage workload with a small growing team this will ideally be through a series of prearranged, intensive, one-week support windows.
Help using your existing skills around code sharing, annotation, version control and git in an OpenSAFELY context; and additional software skills development from working with our developers and developer-researchers on the platform.
Recognition as part of the OpenSAFELY platform team.

What we would like from our work with new pilot users

Close energetic collaboration:
- A solid period of near-full-time commitment from each person where we are investing effort to train, on-board, and support them.
- Commitment to finish the planned research projects within a specific time, to ensure pace and delivery.
Energetic contribution to open code and the platform including:
- Contribution to Codelists.
- Contribution to documentation, especially on new features created to support your work, or in collaboration with you.
- Detailed feedback on the platform.
- Feedback on the on-boarding process.
- Contribution to blog posts and similar on relevant aspects of the platform or your work.
Working in line with the Principles of OpenSAFELY.
High-quality research

What we will look for in potential new pilot users

High quality research proposal and track record and, more specifically: alignment with COVID-19 research priorities, in line with the COPI notice [archived here] and our own team priorities.
Strong existing EHR research skills, and a track record of delivery with NHS EHR data.
Strong computational data science skills including version control, git, and GitHub.
A proven track record on reproducible open science (this is extremely important) including: GitHub repositories demonstrating a track record of sharing EHR analysis code openly; adequate documentation for this prior code; and ideally evidence of helping others to re-use your code or data.
A strong existing understanding of how OpenSAFELY works, from reading our documentation and codebase.
A project whose additional resource requirements from the platform are realistic.
A strong understanding of Information Governance.

We have already identified a range of pilot users, and are now in close discussion with NHS England on specific analyses; however we will update with progress as our resource and the work develops further.