This technical document describes our approach to the first wave of additional pilot users for OpenSAFELY, a new open source platform for EHR research with unprecedented security and support for open science. OpenSAFELY can currently execute analysis code across primary care EHR data for over 58 million patients in TPP and EMIS, linked onto SUS, ONS, ICNARC, ECDS, SGSS and other datasets. You can read more about the technical and security model in our papers, and in our technical manuals which are shared online here, along with a demonstration notebook here.
Onboarding new users to OpenSAFELY is more complex than granting a simple data download, or a login and password, because for security and privacy reasons, OpenSAFELY is very different to other approaches for EHR data analysis. The platform does not give researchers unconstrained access to view large volumes of pseudonymised and disclosive patient data, either via download or via a remote desktop. Instead we have produced a series of open source tools that enable researchers to use flexible, pragmatic, but standardised approaches to process raw electronic health records data into “research ready” datasets, and to check that this has been done correctly, without needing to access the patient data directly. Using this data management framework we also generate bespoke dummy datasets. These dummy datasets are used by researchers to develop analysis code in the open, using GitHub. When their data management and data analysis scripts are capable of running to completion, and passing all tests in the OpenSAFELY framework, they are finally sent through to be executed against the real data inside the secure environment, using the OpenSAFELY jobs runner, inside a container using Docker, without the researcher needing access to that raw potential disclosive pseudonymised data themselves. The non-disclosive summary results output tables, logs, and graphs are then manually reviewed, as in other systems, before release.
This new and highly secure approach afforded by the OpenSAFELY platform has meant NHS England (the data controller) has been able to provide access, cautiously, to an unprecedented scale of data for Covid-19 related analyses. It does, however, mean that new users currently need strong computational data science skills, beyond those needed to write statistical analysis scripts against local data in R or Stata. However, our approach also means that every analysis executed on OpenSAFELY automatically contributes to a growing library of re-usable codelists, variables, and code, rather than unpredictable folders of arbitrary data management and analysis scripts in Python, R and SQL. Furthermore it allows us to efficiently address important strategic and analytic challenges such as minimally disclosive linkage, federated analysis, automated monitoring of disclosiveness, and transparent reporting on all analyses. Lastly, alongside this development work, we are also beginning to implement OpenSAFELY in other environments and against new datasets, so that others can benefit from the benefits of “curation as you go”, our privacy augmentation, and our shared open source codebase.
Next steps
We have set out to create an open source platform for EHR research where researchers can work independently, using the tools and data (with appropriate permissions) without necessarily needing substantial engagement with our team. Increasingly we will also produce tools that do not require advanced data science skills, to support a wider range of users.
As part of building that resource for the community, over the next six months we are working with NHS England to cautiously on-board a small number of external pilot users to develop their analyses on OpenSAFELY. This first wave of pilot users will be collaborators, working closely alongside us to co-develop the platform. Because OpenSAFELY is more than a simple data download or remote desktop service, this first wave of external pilot users need to have substantial existing computational data science skills, and strong experience of working with primary care electronic health record data. They must also be keen to work closely with us to co-develop the OpenSAFELY platform for the community, as described in our Principles of OpenSAFELY document. In turn, as with all those using and contributing to the platform, they will share the credit as the tools, documentation, codebase and codelists are increasingly widely used. During this pilot phase we will be developing our software and approach for external users.
Below we have posted a brief summary of what we can offer to our initial wave of pilot users, and what we would like back from them to support the growth of this open platform. This list is aimed to facilitate discussions as we move cautiously forward to identify new users and collaborators to deliver a thriving open source ecosystem for computational data science on electronic health records, alongside high quality research during the pandemic. Because of limited financial resources, and the need for caution in the pilot phase of this work, we can only accommodate a small number of users until we expand our team. Please note all potential new users and their analysis proposals will also need to be discussed with NHS England who are the Data Controller.
What we can offer to pilot OpenSAFELY users:
What we would like from our work with new pilot users:
What we will look for in potential new pilot users:
We have already identified a range of pilot users, and are now in close discussion with NHS England on specific analyses; however we will update with progress as our resource and the work develops further.