OpenSAFELY

Summary

OpenSAFELY is a new secure analytics platform for electronic health records in the NHS, created to deliver urgent results during the global COVID-19 emergency. It is now successfully delivering analyses across more than 24 million patients’ full pseudonymised primary care NHS records, with more to follow shortly. All our analytic software is open for security review, scientific review, and re-use. OpenSAFELY uses a new model for enhanced security and timely access to data: we don’t transport large volumes of potentially disclosive pseudonymised patient data outside of the secure environments managed by the electronic health record software company; instead, trusted analysts can run large scale computation across near real-time pseudonymised patient records inside the data centre of the electronic health records software company. This pragmatic and secure approach has allowed us to deliver our first analyses in just five weeks from project start.

Team

OpenSAFELY is a collaboration between the DataLab at the University of Oxford, the EHR group at London School of Hygiene and Tropical Medicine, TPP and other electronic health record software companies (who already manage NHS patients’ records), working on behalf of NHS England and NHSX, with a growing list of broader collaborations including ICNARC. We are a well-established team of software developers, clinicians, and epidemiologists, all pooling diverse skills and knowledge to deliver high performance, highly secure and accurate health data analytics, using modern open software development techniques. The project is led by two NHS doctors, Ben Goldacre and Liam Smeeth, with lead software developer Seb Bacon. We have delivered because of our mixed skillset: we have software developers, and “developer-epidemiologists”, who can speak the same language as the technical teams within EHR system suppliers.

Security

All data that carries any privacy risk (even a theoretical risk, and even when pseudonymised) remains within the secure data centre of the electronic health record vendor, where it already resides. This also means that all activity is logged for independent review. All processing takes place in the same secure data centre, where the patients’ electronic records were already stored. The only information to ever leave the data centre is summary tables (with low numbers suppressed) from statistical models. Within the data centre, all pseudonymised data is stored in a tiered system of increasingly less disclosive data stores tailored to each analysis. All underlying software and research code is open to review for security profiling, scientific evaluation, and to re-use as open source tools improving science across the community. Overall this approach is therefore highly secure, and supports high quality science: in contrast to working on intermittent “data extracts”, our approach also ensures that the statistical models run across up-to-date records, which is vital during a global health emergency. Further details on security and governance are given below.

Analyses

We have rapidly deployed this new statistical analysis platform during the Covid-19 emergency to deliver urgent answers on key clinical and public health questions. Our first analysis identifies which patients are most at risk of death in hospital from COVID-19, with more accuracy than any previous analyses by an order of magnitude. We have extensive ongoing collaborations across the scientific community. We are now running analyses to identify which treatments increase or decrease risk (more detail below). We are also supporting modellers to understand and predict the spread of the disease, and pressure on NHS services, using hyperlocal real-world data. The answers provided by OpenSAFELY analyses are of crucial importance to all countries in the world. A longer list of analysis projects is provided below.

Strategic and Clinical Importance for the UK, and the World

This fully functional project proves the value of large NHS datasets, which is often discussed in theory. The questions we are answering are of global importance for clinicians, policymakers and patients around the world. These answers can only be delivered by analysing large datasets, handled securely, on the scale that we have assembled. We have built this project because we believe the UK has a responsibility to the global community to make good use of this data, securely, and to the highest scientific standards. The UK, with the NHS, is the only country on the planet with the scale of data needed to deliver these analyses.

Data

We have currently linked: the full coded primary care record containing all previous medical history, test results, diagnoses, medications, treatments, and more; A&E attendance data; hospital death from Covid-19; ITU data; ONS death data including cause of death. We are able to rapidly map and link new datasets where required. This big data approach with an unusually large volume of primary care data is necessary to get sufficient statistical power to detect associations with specific medications and medical conditions as early as possible during the pandemic and thereby save lives by modifying patient, clinician, and population behaviour.

Portability

All code for our platform is compliant with open standards and designed to be portable, so that it can run against any platform produced by the NHS in the future to securely store rich and linked primary and secondary care patient data.

Open Working Methods

While preserving patient privacy, and keeping all patient data securely, we are using modern open working methods: we are openly sharing all analytic code and development insights, in order to accelerate development of analyses and other tools by other groups with other datasets. Our team has extensive experience with open working methods: we have already shared over 45,000 lines of code on GitHub for our previous NHS data science projects. Our tools are built in Python, SQL, and Docker, with additional statistical analyses called from Stata and R; all our code and analyses are managed through GitHub for efficiency, collaboration, transparency and reproducibility. Our initial focus is on Covid-19: but we are delivering computational data science tools for epidemiology, and mixed teams of software developers working alongside epidemiologists, that will rapidly accelerate delivery for electronic health records analyses in general. You can see examples from our rapidly growing codebase here and here.

Funding

We have developed and deployed a fully functional platform in 5 weeks with no funding. Given modest financial resources we will sustain, accelerate, and expand our work. We have funding applications under review with NIHR and UKRI. We have delivered, at speed, in a space where delivery has historically been slow: we have done it by taking a new approach, and building a new kind of team. If you would like to support OpenSAFELY please get in touch.

Appendix 1: Key Analytic Questions

Our work allows us to rapidly deliver answers on the following important clinical and public health questions, which are currently outstanding, and which cannot be delivered by other means:

  1. Determine which people are at highest risk of hospital admission, ventilation, or death, to inform 111 advice, management choices, seclusion advice, and service planning. For example, there may be certain pre-existing medical problems that put people at much higher risk of Covid-related admission or death, that have not yet been identified, and which mean new categories of people need to be in the high-risk group for self-seclusion during the pandemic.

  2. Rapidly assess specific hypotheses around treatment or prevention as they arise including: the possible benefits of chloroquine or antiretroviral medication for HIV; the possible hazards of ibuprofen; the possible benefits of inhaled corticosteroids; the benefits or hazards of drugs that up-regulate ACE2 receptors (such as ACE inhibitors and angiotensin receptor blockers); possible beneficial effects from the JAK inhibitor baricitinib. These can all be rapidly assessed by assessing rates of admission and death among those who have, and have not, been routinely taking such medications in primary care.

  3. Combine disease dynamics modelling with near-real-time hyperlocal clinical data on prevalence and population at risk, to predict local spread and service need, and (for example) to design and evaluate exit strategies from lockdown.

  4. Measure and mitigate the indirect health impacts of Covid-19: subject to approval we can monitor the data to identify “Covid Aftershocks” and give early warning on clinical work displaced, such as cancer referrals, cardiovascular management, and vaccinations. We can also help identify NHS organisations in need of additional support around delivering good care as the pandemic continues; and rapidly identify success stories from new best practice that others can learn from.

  5. Rapidly evaluate the impact of national interventions (and collect outcomes data for pragmatic cluster randomised trials of preventive or treatment interventions), especially on specific patient groups.

Appendix 2: Further Detail

Security and governance

We are working on behalf of NHS England, who are acting as Data Controller for the purposes of this urgent project; each EHR vendor acts as Data Processor. The Secretary of State for Health issued NHS England/Improvement a notice under the Health Service (Control of Patient Information) Regulations 2002 3(4) which enabled NHS England to collect the data required from GP practices directly from their EHR vendor. All information governance for this urgent project is handled by NHS England. The DPIA approving data flows and access approves linking GP data to outcomes data from the new NHS England and NHSX data store and other sources including CPNS deaths data; ICNARC ITU admissions data; SGSS PHE test data; ECDS A&E patient-level data; ONS death data.

Our approach to privacy and security exceeds standards for many other current EHR analysis projects. We severely restrict SQL query access to the “event-level” data, which would otherwise present the highest theoretical privacy risk. We then abstract the key clinical features of each patient for each analysis into a “feature store” for statistical analysis: this summary data is perfectly matched to the needs of each project, but substantially less vulnerable to re-identification attacks; it is nonetheless still managed to the highest privacy standards, as if it were security-critical event-level data. All access to the secure platform is over highly secure VPN from specific IP addresses and MAC addresses for a very small number of highly trusted, named and experienced analysts whose activity is all fully logged. By building our analytics platform inside the originating EHR vendors’ data centre we completely avoid transporting large raw primary care datasets which would otherwise present a substantial privacy risk, even when pseudonymised.

Project Plan

There are two phases to this work. Phase one is rapidly delivering a secure analytics platform, and then urgent analyses, from an experienced team of EHR analysts using open tools and working methods. Phase two is delivering a more generalisable secure analytics service for work on covid in NHS data, using open source tools and working methods, facilitating analysts by sharing working methods and code for data analysis, which can run against any NHS database service produced in the future.

Phase One: Urgent Analyses

We have already delivered a fully live secure analytics platform inside the data centre of TPP, an electronic health record provider who cover 40% of practices in the country and process data on over 24 million patients including their previous medical history, investigations, and current or past medications. This was an exciting, productive and efficient collaboration at phenomenal scale and pace. Subject to approval we will shortly deliver a new secure analytics platform across more patients' data. We are now running urgent analyses with a wide range of collaborative analytic partners to help identify which patients are most at risk, and why; which treatments increase or decrease risk; and to evaluate spread of the disease, and pressure on NHS services, using hyperlocal real-world data.

Phase Two: open source tools for EHR analysis

Phase two will use our experience of rapidly delivering this covid analytics platform to explore the best technical and governance mechanisms for a secure analytics platform on NHS data that is open to bonafide analysts acting in the public interest, in support of all other work in this domain. We will rapidly expand our open library of analytic code to facilitate and accelerate analysis in NHS electronic health records data in general: this document contains only a brief summary; we will share our larger roadmap shortly. Our group has a longstanding proven track record of delivering high quality open source outputs with over 45,000 lines of code on GitHub, and all analytic code and codelists shared openly to improve the quality of analytics and make it more efficient to deliver for all. There are ongoing discussions that pre-date covid around possible long-term plans to move full primary care records into NHS Digital, alongside ongoing work to assemble larger amounts of NHS data into a warehouse at NHS X/E during the covid pandemic. Our work fully supports all these endeavours. All code for our platform is compliant with open standards and designed to be portable, so that it can run against any new platform containing NHS data.