Box 1. Case Example: AlzEye—Linking Ophthalmic Imaging and Systemic Disease Labels at Scale to Provide New Insights into Dementia (and Cardiovascular Disease)
When trying to achieve the necessary scale of data for machine learning approaches, the use of routinely collected data is an attractive alternative to the high-cost, researcher-led data sets compiled through epidemiologic studies or biobanks. One of the aims of such an approach is to create virtual biobanks much cheaper than otherwise possible (arguably a “biobank-on-a-shoestring”) and which may indeed better reflect the population of interest (vs. the somewhat skewed population that has been observed in some biobank programs).
An example of this kind of approach is AlzEye, the United Kingdom's first and largest linkage of complex three-dimensional imaging data (fundus photographs and retinal OCT) to systemic health diagnostic codes for the purposes of exploring retinal ultrastructural associations and predictors of dementia and its subtypes. AlzEye depends on the combination of both local and nationally held data sets within the United Kingdom's National Health Service (NHS). Specifically, AlzEye is a pseudonymized data set linking retinal photographs and OCT scans of all patients older than 40 years attending Moorfields Eye Hospital NHSFT with Hospital Episode Statistics (HES), a national database consisting of all admissions, emergency attendances, and outpatient appointments in England. The appropriate use and linkage of such data depend on satisfying many criteria, including ethical approval, data security, and governance. Engagement with the public has been pivotal to the approach. We surveyed 483 participants to canvass public opinion on the use of eye scans for research and the acceptability of large data sets to identify patterns of systemic disease. Two members of the public sit on the AlzEye working group, and information regarding the study is outlined on the funding charity's website.
This kind of study is complex, and the approval process that AlzEye underwent was appropriately robust with a number of different approvals required prior to the establishment of AlzEye. Although the exact process will vary from country to country, the processes are likely to share similar principles, and we therefore highlight them here. The first stage required us to secure a research sponsor, necessitating institutional approval consisting of research and development, information governance, and information technology at both the NHS data custodian (Moorfields Eye Hospital NHSFT) and the research institute (University College London). Important conditions involving third-party linkage by a “trusted third party,” robust data privacy measures, and sufficient computing infrastructure were outlined at this stage. In AlzEye, the linkage process is as follows: (1) images from Moorfields Eye Hospital are pseudonymized through the removal of all identifiers and replacement with a unique study ID. These are then transferred to University College London. (2) Simultaneously, a spreadsheet of the image identifiers (date of birth, unique NHS number, sex) is securely sent to NHS Digital, the national body overseeing the HES data warehouse. (3) NHS Digital strips the identifiers and returns the relevant HES data with pseudonymized study IDs to University College London, where it is linked with corresponding images. Thus, HES data never enter the source of imaging data (Moorfields Eye Hospital), and conversely, identifiers never enter University College London (Fig. 1).