December 2022
Volume 11, Issue 12
Open Access
Artificial Intelligence  |   December 2022
A Deep Learning Framework for the Detection and Quantification of Reticular Pseudodrusen and Drusen on Optical Coherence Tomography
Author Affiliations & Notes
  • Roy Schwartz
    Moorfields Eye Hospital NHS Foundation Trust, London, UK
    Institute of Health Informatics, University College London, London, UK
    Quantitative Healthcare Analysis (qurAI) Group, Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
  • Hagar Khalid
    Moorfields Eye Hospital NHS Foundation Trust, London, UK
    Tanta University Hospital, Tanta, Egypt
  • Sandra Liakopoulos
    Cologne Image Reading Center, Department of Ophthalmology, Faculty of Medicine and University Hospital Cologne, University of Cologne, Cologne, Germany
    Department of Ophthalmology, Goethe University, Frankfurt, Germany
  • Yanling Ouyang
    Moorfields Eye Hospital NHS Foundation Trust, London, UK
  • Coen de Vente
    Quantitative Healthcare Analysis (qurAI) Group, Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
    Amsterdam UMC location University of Amsterdam, Biomedical Engineering and Physics, Amsterdam, The Netherlands
    Diagnostic Image Analysis Group (DIAG), Department of Radiology and Nuclear Medicine, Radboud UMC, Nijmegen, The Netherlands
  • Cristina González-Gonzalo
    Quantitative Healthcare Analysis (qurAI) Group, Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
    Diagnostic Image Analysis Group (DIAG), Department of Radiology and Nuclear Medicine, Radboud UMC, Nijmegen, The Netherlands
  • Aaron Y. Lee
    Roger and Angie Karalis Johnson Retina Center, University of Washington, Seattle, WA, USA
    Department of Ophthalmology, University of Washington, Seattle, WA, USA
  • Robyn Guymer
    Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, Australia
  • Emily Y. Chew
    National Eye Institute (NEI), National Institutes of Health (NIH), Bethesda, MD, USA
  • Catherine Egan
    Moorfields Eye Hospital NHS Foundation Trust, London, UK
  • Zhichao Wu
    Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, Australia
  • Himeesh Kumar
    Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, East Melbourne, Australia
    Ophthalmology, Department of Surgery, The University of Melbourne, Melbourne, Australia
  • Joseph Farrington
    Institute of Health Informatics, University College London, London, UK
  • Philipp L. Müller
    Moorfields Eye Hospital NHS Foundation Trust, London, UK
    Makula Center, Südblick Eye Centers, Augsburg, Germany
    Department of Ophthalmology, University of Bonn, Bonn, Germany
  • Clara I. Sánchez
    Quantitative Healthcare Analysis (qurAI) Group, Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
    Amsterdam UMC location University of Amsterdam, Biomedical Engineering and Physics, Amsterdam, The Netherlands
  • Adnan Tufail
    Moorfields Eye Hospital NHS Foundation Trust, London, UK
  • Correspondence: Roy Schwartz, 162 City Road, EC1V 2PD, London, UK. e-mail: royschwartz@gmail.com 
  • Footnotes
    *  CIS and AT contributed equally as co-last authors.
Translational Vision Science & Technology December 2022, Vol.11, 3. doi:https://doi.org/10.1167/tvst.11.12.3
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Roy Schwartz, Hagar Khalid, Sandra Liakopoulos, Yanling Ouyang, Coen de Vente, Cristina González-Gonzalo, Aaron Y. Lee, Robyn Guymer, Emily Y. Chew, Catherine Egan, Zhichao Wu, Himeesh Kumar, Joseph Farrington, Philipp L. Müller, Clara I. Sánchez, Adnan Tufail; A Deep Learning Framework for the Detection and Quantification of Reticular Pseudodrusen and Drusen on Optical Coherence Tomography. Trans. Vis. Sci. Tech. 2022;11(12):3. https://doi.org/10.1167/tvst.11.12.3.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose: The purpose of this study was to develop and validate a deep learning (DL) framework for the detection and quantification of reticular pseudodrusen (RPD) and drusen on optical coherence tomography (OCT) scans.

Methods: A DL framework was developed consisting of a classification model and an out-of-distribution (OOD) detection model for the identification of ungradable scans; a classification model to identify scans with drusen or RPD; and an image segmentation model to independently segment lesions as RPD or drusen. Data were obtained from 1284 participants in the UK Biobank (UKBB) with a self-reported diagnosis of age-related macular degeneration (AMD) and 250 UKBB controls. Drusen and RPD were manually delineated by five retina specialists. The main outcome measures were sensitivity, specificity, area under the receiver operating characteristic (ROC) curve (AUC), kappa, accuracy, intraclass correlation coefficient (ICC), and free-response receiver operating characteristic (FROC) curves.

Results: The classification models performed strongly at their respective tasks (0.95, 0.93, and 0.99 AUC, respectively, for the ungradable scans classifier, the OOD model, and the drusen and RPD classification models). The mean ICC for the drusen and RPD area versus graders was 0.74 and 0.61, respectively, compared with 0.69 and 0.68 for intergrader agreement. FROC curves showed that the model's sensitivity was close to human performance.

Conclusions: The models achieved high classification and segmentation performance, similar to human performance.

Translational Relevance: Application of this robust framework will further our understanding of RPD as a separate entity from drusen in both research and clinical settings.

Introduction
Age-related macular degeneration (AMD) is defined by the presence of drusen, deposits found under the retinal pigment epithelium (RPE), which are key to the diagnosis of AMD.1 Recent advances in multimodal imaging have, however, allowed us to substantially improve our ability to characterize the AMD phenotype, revealing information about a variety of the deposits that occur in AMD, such as reticular pseudodrusen (RPD).2 RPD have been associated with late AMD and are considered a critical AMD phenotype to understand.312 To date, most studies associating AMD risk with RPD have relied on a binary presence of RPD (i.e. their presence or absence) with no clear understanding of how the quantity of RPD plays into the risks posed by their presence. Understanding associations and risk of RPD is confounded by the fact that eyes with RPD often also have drusen, which impose their own risks. To help improve our understanding of RPD and their associations, large datasets are essential but to date most available large datasets are based on cohorts collected for their AMD status, and few have eyes with only RPD. This leads to confounder issues when trying to understand the contribution that RPD make to any increased risk of vision loss in eyes with AMD. 
Spectral-domain optical coherence tomography (SD-OCT) has been shown to have a much higher sensitivity and specificity for both detecting RPD and separating lesions from drusen compared with the blue channel of color fundus photographs (CFPs), infrared reflectance, fundus autofluorescence, near-infrared fundus autofluorescence, confocal blue reflectance, and indocyanine green angiography.13,14 In addition, OCT is the only imaging modality that allows the confirmation of the subretinal localization of RPD, which cannot be ascertained by other imaging modalities.2 Given the subtlety of RPD lesions on OCT, even on the latest generation devices, and more so on early generation OCT utilized in existing large population studies, human detection and quantification remain a challenge.15 Given the importance of being able to detect and quantify RPD and separate them from drusen in terms of both our understanding of the pathogenesis of RPD and the potential implication of their presence in current and future therapies,16 an automated approach to classification and quantification is needed. 
Machine learning (ML) algorithms have been shown to be powerful tools in the automatic quantification of retinal biomarkers identified on OCT,1719 making them ideal for the detection of RPD and drusen. To date, there is a large volume of published studies describing the detection of drusen on OCT using ML, the majority of which deploy classification models which do not allow for the quantification of the lesion area.2024 Thus far, only two studies explored ML techniques for the automatic detection of RPD on OCT. The first was a classifier, thus not allowing for image quantification,25 and the other was based on the identification of drusen and RPD by interpolating retina layer undulations. The latter approach was only internally validated on a small number of eyes and has not been shown to perform on images from the more challenging imaging generated from older SD-OCT devices used in a number of large population studies or distinguish between RPD stages.26 
We herein present a deep learning (DL) framework for the detection and quantification of drusen and RPD in the UK Biobank (UKBB), a large-scale biomedical database and research resource containing genetic, lifestyle, and health information from a half million of UK participants. 
Methods
Study Population
The UKBB study is a large, multisite, community-based cohort study with the aim of improving the prevention, detection, and treatment of a wide range of serious and life-threatening diseases. UKBB's database includes data on 500,000 volunteer participants aged between 40 and 69 years, recruited in 2006 to 2010 from across the United Kingdom. All UK residents aged 40 to 69 years who were registered with the National Health Service and living up to 25 miles from one of 22 study assessment centers were invited to participate. The North West Multi-centre Research Ethics Committee approved the study (REC reference number: 06/MRE08/65), in accordance with the principles of the Declaration of Helsinki. Detailed information about the study is available at the UKBB website (www.ukbiobank.ac.uk). 
Of all participants in the UKBB, 67,687 participants underwent OCT and CFP imaging, at six UKBB centers (Sheffield, Liverpool, Hounslow, Croydon, Birmingham, and Swansea) acquired using the Topcon 3D OCT 1000 Mark II (Topcon, Japan). Image acquisition was performed under mesopic conditions, without pupillary dilation, using the 3-dimensional macular volume scan (512 horizontal A-scans/B-scan; 128 B-scans in a 6 × 6-mm raster pattern). Of 2622 participants with a self-reported diagnosis of AMD identified in the database, 1284 had OCT volume scans and CFPs and were used in the study. The UKBB project ID associated with this paper is 60078. Patients were excluded from the study if they had withdrawn their consent. 
Deep Learning Framework
Upon visual inspection, a significant number of OCT scans were found to be of insufficient quality for this study. To mitigate this, and to improve the accuracy of DL, a framework consisting of several separate DL models was developed (Fig. 1): (a) A classification model to detect ungradable scans (Ungradable Classification Model), based on the difference in signal-to-noise ratio between gradable and ungradable scans. (b) An out-of-distribution detection model to further classify ungradable scans (see Model Development below), based on the difference between gradable and ungradable scans resulting from outliers caused by optical artifacts (Outlier Detection Model). (c) A classification model to identify scans with drusen or RPD versus controls (those without these lesions; Drusen/RPD Classification Model). (d) An image segmentation model to independently segment lesions as RPD or drusen, allowing their quantification (Drusen/RPD Segmentation Model). 
Figure 1.
 
Deep learning framework for the detection and quantification of conventional drusen and reticular pseudodrusen. A classification algorithm classifies OCT volumes (on a volumetric (i.e. eye) level into gradable or ungradable, and ungradable volumes are removed (Ungradable Classification Model). A deep ensemble model for out-of-distribution detection identifies volumes with out-of-distribution scans, which are then removed (Outlier Detection Model). Another classification model, the Drusen/RPD classifier, classifies the remaining volumes into those with either drusen or RPD versus controls. Controls are removed (Drusen/RPD Classification Model). Finally, an image segmentation algorithm segments RPD and drusen separately on a B-scan level (Drusen/RPD Segmentation Model). RPD, reticular pseudodrusen.
Figure 1.
 
Deep learning framework for the detection and quantification of conventional drusen and reticular pseudodrusen. A classification algorithm classifies OCT volumes (on a volumetric (i.e. eye) level into gradable or ungradable, and ungradable volumes are removed (Ungradable Classification Model). A deep ensemble model for out-of-distribution detection identifies volumes with out-of-distribution scans, which are then removed (Outlier Detection Model). Another classification model, the Drusen/RPD classifier, classifies the remaining volumes into those with either drusen or RPD versus controls. Controls are removed (Drusen/RPD Classification Model). Finally, an image segmentation algorithm segments RPD and drusen separately on a B-scan level (Drusen/RPD Segmentation Model). RPD, reticular pseudodrusen.
Data Selection
Classification Models
To train the classification models, each OCT volume (eye) was labeled by a single grader (author R.S.), a retina specialist, as ungradable; containing drusen/RPD, or both; or control (not containing drusen or RPD). 
Volumes were deemed ungradable if the outer retina was not clearly seen in a scan in a manner that would allow to confirm or reject the presence of RPD and drusen (e.g. due to image noise, shadowing, or clipping of the outer retina) or in cases where vertically flipped scans existed in the volume. 
Drusen were defined as discrete areas of RPE elevation with low to medium reflectivity, similar to the reflectivity of the inner plexiform and ganglion cell layers. RPD were defined as lesions above the RPE with medium reflectivity, similar or slightly less than the reflectivity of the retinal nerve fiber layer. Each of the previously described stages were also considered when labeling eyes as RPD2,27: stage 1 - diffuse deposition of granular hyperreflective material between the RPE and the ellipsoid zone (EZ); stage 2 - similar to stage 1, but the mounds of accumulated material are sufficient to alter the contour of the EZ, resulting in EZ undulations; stage 3 - the material is thicker, adopts a conical appearance, and breaks through the EZ; stage 4 - defined by fading of the material because of re-absorption and, eventually, migration within the inner retinal layers. Although the above grading was based on OCT findings, a multimodal approach was used, when possible (i.e. when image quality was sufficient), to confirm the presence of OCT findings using CFP. Of note, the presence of RPD due to other pathologies was considered in each case, but no evidence of other pathologies was seen in any of the cases. 
Each eye was graded according to the following scale: (1) no drusen/RPD; (2) one drusen/RPD; (3) more than one drusen/RPD; (4) questionable drusen/RPD; and (5) Ungradable. Categories 1 and 3 were used to train the drusen detection classification model, and category 5 was used to train the ungradable detection classification model. Category 2 was not used because the identification of RPD is challenging, and the presence of a pattern helps to distinguish cases with genuine RPD versus human variability. Therefore, to reduce the risk of including false-positive cases, it was decided to include only cases with more than a single RPD lesion. For uniformity, the same was applied to drusen. Category 4 was not used as the inclusion of questionable lesions might degrade the model's performance. 
Of 2622 participants (5199 eyes) with self-reported AMD, 1284 (2523 eyes) had OCT scans. Four hundred eighty-nine eyes of 287 participants were classified as having more than one druse; 57 eyes of 38 participants were classified as having more than one RPD; 343 eyes of 232 participants were classified as ungradable; and 1182 eyes of 591 participants were identified as having no drusen/RPD (controls). In addition, to avoid selection bias that may result from the selection of controls out of a population of self-reported AMD, 250 control eyes were randomly identified from the general cohort. Eventually, 500 control eyes, 468 eyes with any drusen or RPD and 308 eyes with ungradable scans were included. They were divided into a training, validation, and test set by a ratio of 60:20:20. Eyes of a specific participant were not allowed to exist in more than one set (Fig. 2A). 
Figure 2.
 
Eye and B-scan selection flowchart for model training, validation and testing. (A) Selection of eyes to train, validate, and test the classifiers. (B) Selection of B-scans to train, validate, and test the semantic segmentation model. AMD, age-related macular degeneration; OCT, optical coherence tomography; RPD, reticular pseudodrusen.
Figure 2.
 
Eye and B-scan selection flowchart for model training, validation and testing. (A) Selection of eyes to train, validate, and test the classifiers. (B) Selection of B-scans to train, validate, and test the semantic segmentation model. AMD, age-related macular degeneration; OCT, optical coherence tomography; RPD, reticular pseudodrusen.
Semantic Segmentation Model
To train the semantic segmentation model, additional cases were identified in the UKBB dataset as an addition to the cohort of participants with a self-reported diagnosis of AMD, which was used to train the classifiers. To identify additional cases, we used an in-house available deep learning approach developed to detect AMD features in CFPs. It does so in a hierarchical manner by first detecting drusen. Of those, it then detects large drusen, and of those it detects RPD.28 By using this approach and manually removing low-quality images, an additional 22 eyes were found to have more than one RPD after visual inspection (using the same grading methodology on OCT scans mentioned previously for the self-reported AMD cases) and included in the training set for the segmentation model. 
As the model was trained on B-scans rather than OCT volumes, B-scans were classified by a single grader (author R.S.) into the 3 groups as previously mentioned: 2834 scans with RPD, 2338 with drusen, and 4946 controls. Of those, B-scans with RPD were selected manually for training if they contained at least one RPD, with or without drusen, from different areas of the macula, to reflect the variability in RPD appearance. Overall, 334 B-scans (from 37 participants) with RPD were included. The same number of B-scans of drusen (from 38 participants) and controls were randomly selected for training. These were divided into training, validation, and test sets (using a ratio of 60:20:20). B-scans of a specific participant were not allowed to exist in more than one set (Fig. 2B). 
Annotation
Manual delineation of features (drusen and each stage of RPD) to train the image segmentation model was performed by five experienced graders. The training and validation sets were independently annotated by two retina specialists (authors R.S. and H.K.) and the second grader (author H.K.) was used as the ground truth for training the model. An additional three retina specialists (authors A.T., S.L., and Y.O.) independently annotated all the scans in the test set. Annotation was done using Label Studio version 1.2.29 The graders were provided a list of B-scans, shuffled to avoid priming bias (i.e. the tendency to annotate lesions based on previously seen lesions in the same eye). They had access to the complete OCT volume and could zoom in for accurate delineation. A document containing instructions and examples of the correct annotation of labels of interest was provided to graders and discussed with them. It included the definitions mentioned previously for drusen and different stages of RPD. Each of the lesion types was assigned a label and a different color. Graders were asked to grade a standard set of 6 B-scans containing examples of each label prior to annotating their respective sets and an adjudication process took place (author R.S.) to ascertain uniformity among graders. 
Model Development
All models were trained on a single server with an Intel 18 core 4.6 GHz Xeon processor, 256 GB of RAM, and an Nvidia Quadro RTX8000 card with 48 GB of RAM. 
Classification Models
The architecture for the Ungradable Classification Model and the Drusen/RPD Classification Model was a 3D Inception-V1.30 The 2D convolutions in the original Inception-V1 model were replaced with 3D convolutions. Except for the last convolution, a batch normalization layer31 and rectified linear unit (ReLU) activation function32 followed each convolution. The last convolution was followed by a softmax layer. We used Adam33 with a learning rate of 10-4, β1 = 0.9, and β2 = 0.999 as the optimizer. During training, batches were randomly sampled in a balanced manner such that samples from each class were chosen equally often. Cross-entropy was used for the loss function. We used early stopping with a patience of 10,000 iterations based on the kappa score on the validation set. Data augmentation was applied to the training set, which consisted of random rotations between -20 and +20 degrees, shearing between -10% and +10%, zooming between -10% and +10%, translations between -10 and +10 pixels in the B-scan plane, translation between +2 and -2 pixels in the z-direction, horizontal B-scan flipping with a probability of 15%, gaussian noise with a mean of 0, a standard deviation of 0.1 and a probability of 15%, gamma corrections with γ between 0.75 and 3.0, and a probability of 15%. 
The Ungradable Classification Model was trained on a specific dataset, as mentioned above, which involved specific types of image aberrations. Because the data used to train the model represents only roughly 1.5% of the total UKBB dataset, a model trained to identify specific types of aberrations might not generalize well to the whole dataset (or other datasets). Therefore, as part of the ungradable detection algorithm, we used deep ensembles34 for out-of-distribution (OOD) detection in addition to the previously mentioned classification model. This is a commonly used technique for uncertainty estimation and OOD detection that approximates Bayesian neural networks. Thus, it should detect any deviation from normal scans, which should in theory also identify aberrations the previous model was not trained on. In this work, the deep ensemble consisted of 10 individual models, each individually trained on the entire training set with different weight initializations and different seeds for random sampling. During inference, we used the mean variance for each class among the models in the ensemble as a measure for the uncertainty of a sample. Ungradable cases were then differentiated from gradable ones based on this uncertainty measure. Of note, both models were tested on the totality of the test set. 
Semantic Segmentation Model
A 2D U-Net architecture35 was trained using the nnU-Net framework, which has achieved high-performance values for various medical segmentation tasks and has the advantage of automatically adapting to different biomedical datasets.36 For training, five-fold cross-validation was used and testing was performed with an ensemble of the cross-validated models. 
Due to the limited numbers of B-scans for stages 3 and 4 RPD, stages 2, 3, and 4 RPD were grouped together as a single class. Thus, the model was trained to distinguish among 3 classes: drusen; stage 1 RPD; and stages 2, 3, and 4 RPD. 
Statistical Analysis
We evaluated the performance of the classification models using five metrics defined as follows: (a) area under the receiver operating characteristic (ROC) curve (AUC) - an ROC curve37 displays the trade-off between the true-positive rate and true-negative rate of a classification model at different threshold levels. AUC represents the model's capability to separate the negative and positive classes. (b) Accuracy - the percentage of correctly classified images. (c) Cohen's kappa38 - compares the observed accuracy with an expected accuracy (random chance); (d) sensitivity; (e) specificity; and (f) area under the precision-recall curve – a precision-recall curve displays the trade-off between the positive predictive value and the sensitivity of a classification model at different threshold levels. 
We evaluated the performance of the segmentation model using the following measures. To measure the segmentation performance we identified the number of individual features that were properly detected (i.e. overlapped with the ground-truth segmentation of the feature) within each B-scan by using the Label function of the Scikit Image Python library, which finds connected components in a binary image. We analyzed the overlap using free-response receiver operating characteristic (FROC) curves. Similar to ROC curves, FROC curves compare the performance of graders by highlighting the sensitivity for both graders when operating at varying false-positive rates. Unlike ROC curves, the sensitivity is plotted against the number of average false-positive lesions per instance (in this case, per B-scan). We also reported the Dice similarity metric, which is defined as the size of the intersection of two areas divided by their average individual size. A Dice score of 1 indicates perfect agreement and a score of 0 indicates disjoint areas.39 
In addition, standard repeatability metrics, including the intraclass correlation coefficient (ICC) for absolute agreement and the Bland-Altman repeatability coefficient (RC) were used to measure agreement in the area of the different lesions between the model and graders and for interrater reliability analysis. For model-grader agreement, the mean of the ICC and the RC between the model's segmented areas and those segmented by each grader is presented. Cases with no segmentation were included as area of zero. The ICC was calculated using the Pingouin library for Python.40 ICC values were interpreted as follows41: a value below 0.50 was considered poor; a value between 0.50 and 0.75 was considered moderate; a value between 0.75 and 0.90 was considered good; and a value above 0.90 was considered excellent. RC was calculated using the technique described by Bland and Altman.42 
Results
Classification Models
Metrics obtained by the different models are presented in Table 1, and ROC curves are presented in Figure 3. All models achieved a high AUC, ranging from 0.93 to 0.99, and high accuracy, ranging from 81.6% to 98.4%. Between the models aimed at image quality assessment, the Ungradable Classification Model achieved a higher sensitivity, while the Outlier Detection Model achieved high sensitivity. The Kappa scores ranged from 0.59 to 0.97. Precision-recall curves for the different models are presented in Supplementary Figure S1. The area under the precision-recall curve was 0.88, 0.76, and 0.99 for the Ungradable Classification Model, Outlier Detection Model, and Drusen/RPD Classification models, respectively. 
Table 1.
 
Metrics Obtained by the Different Classification Models on the Test Set
Table 1.
 
Metrics Obtained by the Different Classification Models on the Test Set
Figure 3.
 
Receiver operating characteristic (ROC) curves for the three classification models. From left to right, the curves apply to the RPD/drusen vs. controls model, the Ungradable Classification Model, and the Outlier Detection Model. The orange line represents the models’ sensitivity at different thresholds with the shaded area representing the 95% confidence interval.
Figure 3.
 
Receiver operating characteristic (ROC) curves for the three classification models. From left to right, the curves apply to the RPD/drusen vs. controls model, the Ungradable Classification Model, and the Outlier Detection Model. The orange line represents the models’ sensitivity at different thresholds with the shaded area representing the 95% confidence interval.
Segmentation Model
Quantitative results for the ICC for each feature are presented in Table 2. The ICC for the model's performance against all graders was averaged and is presented alongside the model-grader performances. In addition, the intergrader agreement among all three graders is presented. When considering the test set, for drusen, the model and graders both achieved moderate agreement, with higher agreement achieved by the model compared to the intergrader agreement. For stage 1 RPD, both the model and human graders achieved poor agreement, again with the model exceeding human agreement. The agreement of both humans and model was again poor for stages 2, 3, and 4 RPD, this time with intergrader agreement exceeding the model agreement. When the RPD area of all RPD stages combined was considered, both the intergrader agreement and the model's agreement with the ground truth were moderate, with the intergrader agreement exceeding the model’s agreement. RC results are presented in Table 3
Table 2.
 
ICC Scores on the Test Set for the Model and Graders
Table 2.
 
ICC Scores on the Test Set for the Model and Graders
Table 3.
 
Bland-Altman Repeatability Coefficient Scores (in µm2) on the Test Set for the Model and Graders
Table 3.
 
Bland-Altman Repeatability Coefficient Scores (in µm2) on the Test Set for the Model and Graders
FROC curves comparing the model's performance against each grader are presented in Figure 4. The most experienced grader for this task was chosen as a reference standard against model performance and against other graders. This figure highlights the sensitivity for both the graders and the model when operating at varying false-positive rates, with confidence intervals (CIs) obtained by bootstrapping (1000 bootstrap samples). For drusen, stage 1 RPD, stages 2, 3, and 4 RPD, and all stages RPD, the 95% CI of the model overlaps with the CI of the grader marked in blue, and only for drusen with both graders. For drusen, for stages 2, 3, and 4 RPD, and for all stages RPD, the model obtained a sensitivity that is lower than both graders when operating at the same false-positive rate, whereas for RPD stage 1 it was higher than one grader and lower than the other. 
Figure 4.
 
Free response receiver operating characteristic (FROC) curves for drusen, stage 1 reticular pseudodrusen (RPD), stages 2, 3, and 4 RPD, and all RPD stages combined, comparing the model to the ground truth graders. The line represents model sensitivity at different thresholds, with the shaded area representing the 95% confidence interval, obtained by bootstrapping. The dots represent the two other graders, one represented in blue and the other in yellow, with error bars representing 95% confidence intervals.
Figure 4.
 
Free response receiver operating characteristic (FROC) curves for drusen, stage 1 reticular pseudodrusen (RPD), stages 2, 3, and 4 RPD, and all RPD stages combined, comparing the model to the ground truth graders. The line represents model sensitivity at different thresholds, with the shaded area representing the 95% confidence interval, obtained by bootstrapping. The dots represent the two other graders, one represented in blue and the other in yellow, with error bars representing 95% confidence intervals.
The Dice scores between the model and graders and between grader pairs are presented in Supplementary Table S1. Qualitative results of the output of the segmentation model are shown in Figures 56, and 7
Figure 5.
 
Comparison of model and grader output, reticular pseudodrusen (RPD) stage 2, 3, or 4. The green color represents stage 2, 3, or 4 RPD. Grader 1 was used as a reference standard against model performance and against other graders, presented in the FROC curves.
Figure 5.
 
Comparison of model and grader output, reticular pseudodrusen (RPD) stage 2, 3, or 4. The green color represents stage 2, 3, or 4 RPD. Grader 1 was used as a reference standard against model performance and against other graders, presented in the FROC curves.
Figure 6.
 
Comparison of model and grader output, reticular pseudodrusen (RPD) stage 1 and RPD stages 2, 3, or 4. Stage 1 RPD is represented in white and stage 2, 3, and 4 are represented in green. Grader 1 was used as a reference standard against model performance and against other graders, presented in the FROC curves.
Figure 6.
 
Comparison of model and grader output, reticular pseudodrusen (RPD) stage 1 and RPD stages 2, 3, or 4. Stage 1 RPD is represented in white and stage 2, 3, and 4 are represented in green. Grader 1 was used as a reference standard against model performance and against other graders, presented in the FROC curves.
Figure 7.
 
Comparison of model and grader output, conventional drusen, reticular pseudodrusen (RPD) stage 1 and RPD stages 2, 3, or 4. Drusen are represented in blue, stage 1 RPD in white, and stage 2, 3, and 4 are represented in green. Grader 1 was used as a reference standard against model performance and against other graders, presented in the FROC curves.
Figure 7.
 
Comparison of model and grader output, conventional drusen, reticular pseudodrusen (RPD) stage 1 and RPD stages 2, 3, or 4. Drusen are represented in blue, stage 1 RPD in white, and stage 2, 3, and 4 are represented in green. Grader 1 was used as a reference standard against model performance and against other graders, presented in the FROC curves.
Discussion
We present a DL framework that is robust in the presence of ungradable scans, and accurately classifies and segments drusen and RPD. To the best of our knowledge, this is the first framework handling different aspects of lesion analysis in AMD, including automated image quality assessment and lesion detection, and this is the first DL model to allow accurate quantification of these lesions. 
Our two classifiers for image quality assessment were designed to perform two different tasks: the first, quality assessment, achieved by detecting the difference in signal-to-noise ratio between gradable and ungradable scans; and the second assessment was for the detection of outliers caused by optical artifacts by OOD detection. Both classifiers achieved high performance in detecting poor quality scans (AUC of 0.95 and 0.93 for the Ungradable Classification Model and the Outlier Detection Model, respectively). They both serve as steps in automated data curation. Image quality control is essential to ensure optimal performance by a DL algorithm designed to be deployed on real-world data.43 Unlike research and development environments, where such models are often trained on carefully curated datasets, real-world data may be more challenging, as evidenced by a recent attempt by Google Health to deploy a diabetic retinopathy model in a clinic setting, where its performance was worse than in the laboratory setting.44 To date, only a small number of publications described the use of ML for image quality assessment on OCT scans. For example, Kauer et al. developed an ML classifier (AQuA), which was trained on OCT images acquired on the Spectralis SD-OCT device (Heidelberg Engineering) to identify poor quality scans.45 Later, another neural network termed AQuANet was developed to allow AQuA to be adapted to OCT devices from other vendors. It was shown to transfer well to the Cirrus HD-OCT device (Carl Zeiss Meditec AG).46 However, both devices are characterized by high-quality scans which are often lacking in existing large population studies’ datasets acquired on older devices. To the best of our knowledge, our framework is the first published to classify poor quality scans on such devices, making it useful for research involving similar large datasets. It is also the first to utilize image quality control as part of a detection and quantification framework, a fact which should increase its accuracy when deployed on target datasets. The use of out-of-distribution detection alongside a classifier trained on specific examples of ungradable volumes allows the model to be more generalizable to previously unseen image artifacts. 
The Drusen/RPD Classification Model achieved an AUC of 0.99. However, it is important to note that it was tested on a subset of the dataset that did not contain ungradable scans, as mentioned in the Methods section, and these results do not reflect possible degradation of performance under such circumstances. Despite this limitation, and to the best of our knowledge, this is the first classifier that can detect both drusen and RPD. Numerous studies previously reported on classification algorithms for the detection of drusen only.2022,47 The ability to detect RPD as well as drusen can be used both for screening high-risk patients and for research into the latter, in addition to its role in the current framework. 
For the Drusen/RPD Segmentation model, the model's agreement with human graders, as reflected by ICC scores, was better than the intergrader agreement for drusen. In regard to RPD, it was better for stage 1 RPD. Stage 1 RPD, as reflected by the very low intergrader agreement for segmentation of this lesion, is an exceptionally challenging lesion to grade because it only presents as a medium reflectivity change between the RPE and EZ with the additional loss of the normal anatomy between these layers. With older devices, the loss of anatomy is harder to appreciate, making their annotation a more challenging task. In addition, it seems that distinguishing between stage 1 and other stages of RPD presented a challenge for both humans and the model, as reflected by the better agreement (moderate vs. poor) when RPD lesions of all stages are considered, and as reflected in the qualitative examples (see Figs. 56). Despite the difficulty this dataset presents, the model achieved performance that is either beyond human performance (drusen, stage 1 RPD) or close to human performance (RPD stages 2, 3, and 4; all RPD stages combined). 
Of note, it is possible that the test set, chosen randomly, was challenging to annotate. For comparison, the intergrader agreement between the two graders who segmented the training and validation set was also calculated and was higher than that achieved for the test set. It was 0.94 (95% CI = 0.91 to 0.95) for the drusen area, 0.67 (95% CI = 0.6 to 0.73) for the RPD area when all stages were considered, 0.34 (95% CI = 0.2 to 0.45) for stage 1 RPD, and 0.72 (95% CI = 0.67 to 0.76) for stages 2, 3, and 4 RPD. If the test set was more challenging for humans, it can be implied that it was more challenging for a DL model, and better performance is expected on less challenging datasets. However, the difference in ICC scores may have resulted from higher consistency among the two graders who annotated the training and validation set separately from the three who graded the test set. 
Similar findings were seen in the FROC curves. The model achieved sensitivity that is close to human performance, and, in fact, is similar to a senior retina expert, as evidenced by the overlapping CIs seen in the plot. Given the complexity of grading these lesions, it is possible that more graders, especially with less experience, would have fared worse than the model. Of note is the improved sensitivity of the model with the increased number of average false-positive lesions per B-scan, the importance of which depends on the settings under which the model is used. For example, we intend to use it to identify participants in the UKBB with RPD for further research, including genetic analysis, where quantification is key. To that end, a small number of false-positive results (i.e., high specificity) is required. Other settings might emphasize sensitivity over specificity, for example, when the identification of patients with RPD is required for screening purposes. In such cases, a higher number of false-positive results might be allowed (especially if human validation is involved), enabling higher sensitivities for the model (almost 100% for drusen and 80% for RPD). 
Another point to consider is the fact that the segmentation model was tested on a B-scan level. The FROC curves present an average number of false-positive results per scan. As with any average, some B-scans will fare better than others. When the model is deployed to whole volumes (eyes) to quantify RPD and drusen on a volumetric level, such inaccuracies may be less prominent. 
The Dice scores (presented in Supplementary Table S1) reflect slightly better intergrader performance than model performance for all features (ranging from 0.06 for stage 1 RPD to 0.16 for all-stage RPD). Of note, both human and model performance as reflected in the Dice score were poor. As was shown in a recent publication by an international consortium of medical image analysis experts, Dice score is not an appropriate metric for small structures in images, because a single-pixel difference between two predictions can have a large impact on the metric difference.48 Given the small size of the lesions graded in our work we utilized FROC as the primary metric to assess the performance of the segmentation model. 
Thus far, only two studies have described the use of ML solutions for the automatic detection of RPD on OCT. In the first, by Saha et al.,25 the authors trained several DL models to detect RPD, intraretinal hyper-reflective foci, and hyporeflective foci within drusen. Although the model's performance was good for the detection of RPD (sensitivity = 79–96%, specificity = 65–92%, AUC = 0.91–0.94, and accuracy = 80–86%) all models were classifiers. Therefore, the output was binary, and quantification of the lesion area was not possible. 
In the second study, by Mishra et al.,26 the authors chose a different approach, whereby retinal layers associated with drusen and RPD were automatically segmented in SD-OCT images along with other retinal layers. The methodology involved a combination of a graph-based approach based on the Deep Learning - Shortest Path (DL-SP) algorithm on 2D OCT B-scan images. In that regard, drusen and RPD were considered types of layers – the former where the RPE layer is undulating, and the latter with undulation of the EZ. This technique presents several problems. First, undulations of these layers are not specific to drusen and RPD and may result from other pathologies. The extent to which this model can handle other pathologies and differentiate between them and the aforementioned lesions is unclear since the model was trained on 16 eyes with AMD only, and it is not clear if pathologies other than drusen (such as choroidal neovascularization) were included. In addition, although quantification of the lesions may be possible by calculating the inter-layer area for each of the lesions, this was not done in the study and it is unclear how accurate such a methodology would be. 
Our group previously published a DL model for the segmentation of 13 features associated with neovascular and atrophic AMD. It achieved a lower ICC for drusen compared to the current study (0.381 ± 0.055, compared with 0.74 [95% CI = 0.65 to 0.82]) and lower sensitivity of under 40% for one average false-positive RPD (in comparison with 52% for the current model).17 That is despite the former model being trained on a newer, higher resolution device (Topcon 3D OCT-2000; Topcon, Tokyo, Japan). That can be explained by the higher number of B-scans for both drusen and RPD used in the development of the current model, different DL architectures, and/or the quality of ground-truth annotations. 
Previously published models are not quantitative, have limited accuracy, or cannot perform on the more challenging features of early generation SD-OCT devices. All three issues need to be addressed to accelerate our understanding of how RPD influence the pathogenesis of non-neovascular AMD, for which no licensed treatment exists. We are currently using transfer learning techniques to develop models for newer generation devices. Development of such models is less challenging given the higher resolution and easier delineation of small lesions, for both humans and algorithms. These models can then be utilized in large clinical trials where different, newer, devices may be used for data acquisition. 
In addition, this framework includes the only model that can accurately differentiate between drusen and RPD. Detection of both lesion types is needed to achieve an understanding of risk factors for RPD and drusen load by separating patients with RPD, RPD and drusen, drusen, and normal controls in future studies. 
Our framework can be used in the future in treatment trials. For example, in the Laser Intervention in Early Stages of Age-Related Macular Degeneration (LEAD) study, aimed to evaluate the safety and efficacy of subthreshold nanosecond laser in intermediate AMD, it was found that such treatment may be inappropriate in patients with RPD compared to those without.15 This suggests that treatments for those with RPD might need to be different from those with drusen, and as such it will be important to be able to accurately and quickly identify patients with or without RPD. 
Our study has several limitations. (a) It was trained on data from the UKBB. In the UKBB cohort, 94.6% of participants were of White ethnicity. This is similar to the national population of the same age range in the 2001 UK Census (94.5%) but slightly higher than in the 2011 Census (91.3%).49 Whereas these figures point at generalizability of the model to the UK population, it might not be generalizable to other populations with different ethnic and sociodemographic compositions. (b) The performance of the different models was not evaluated in an external dataset, and might not generalize to datasets other than UKBB. (c) We used different reference standards for the training and the test set. (d) We evaluated each model within the suggested framework as separate steps and not as part of a continuous pipeline. (e) We did not use end-to-end training for this project, meaning that each model was optimized independently. (f) The framework steps were trained on training data excluding questionable lesions. The ill-definition of this category, as well as the associated large intra- and intergrader variability, prevent the definition of a reliable reference standard for training. This, however, might result in unpredictable behavior during inference if questionable lesions are present. Contingency strategies, such as uncertainty estimation or runtime failure detection, might help diminish a performance drop. (g) We selected an experienced grader to serve as ground truth. However, as the results show, ground truth is difficult to determine as human agreement on this problem is low. (h) The framework can only be used on OCT scans. Although OCT has been shown to have high sensitivity and specificity for the detection of RPD, a multimodal imaging approach to their diagnosis may be preferred in difficult cases, or in cases where the lesions extend beyond the scanning area. 
In conclusion, we present the first DL framework encompassing image quality assessment, differentiation of drusen and RPD from controls, and individual segmentation of these two phenotypes, with near-human performance. The application of this model in research settings and possibly in clinical settings will help further our understanding of RPD as a separate entity from drusen. 
Acknowledgments
Sponsored by the EURETINA Retinal Medicine Clinical Research Grant. 
R.S., C.E., and A.T. received a proportion of their financial support from the UK Department of Health through an award made by the National Institute for Health Research to Moorfields Eye Hospital NHS Foundation Trust and UCL Institute of Ophthalmology for a Biomedical Research Centre for Ophthalmology. 
A.Y.L. has received an unrestricted and career development award from RPB, Latham Vision Science Awards, NEI/NIH K23EY029246 and NIA/NIH U19AG066567. 
C.I.S. received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement No. 116076. This Joint Undertaking receives support from the European Union's Horizon 2020 research and innovation program and EFPIA and Carl Zeiss Meditec AG. J.F. is supported by a UKRI training grant (EP/S021612/1). 
This article reflects the views of the authors, it does not reflect the views of the US Food and Drug Administration. 
Disclosure: R. Schwartz, Apellis (E); H. Khalid, None; S. Liakopoulos, Allergan (C, R), Apellis (C, R), Alcon (R), Bayer (C, R), Novartis (F, C, R), Zeiss (R), Heidelberg Engineering (R), Boehringer-Ingelheim (C), Medscape (C); Y. Ouyang, None; C. de Vente, None; C. González-Gonzalo, None; A.Y. Lee, Santen (F), Genentech (R), US Food and Drug Administration (F), Johnson & Johnson (F), Carl Zeiss Meditec (F), Topcon (R), Gyroscope (R), Microsoft (N), Regeneron (F); R. Guymer, Novartis (C), Bayer (C), Apellis (C), Roche (C), Genentech (C); E.Y. Chew, None; C. Egan, Heidelberg Engineering (C), Inozyme (C); Z. Wu, None; H. Kumar, None; J. Farrington, None; P.L. Müller, None; C.I. Sánchez, None; A. Tufail, Annexon (C), Allergan (C), Apellis (C), Bayer (C), Genetech (C), Heidelberg Engineering (C), Iveric Bio (C), Novartis (C), Oxurion (C), Roche (C) 
References
Coleman HR, Chan CC, Ferris FL, Chew EY. Age-related macular degeneration. 2008; 372: 11.
Zweifel SA, Spaide RF, Curcio CA, Malek G, Imamura Y. Reticular Pseudodrusen Are Subretinal Drusenoid Deposits. Ophthalmology. 2010; 117(2): 303–312.e1. [CrossRef] [PubMed]
Sarks J, Arnold J, Ho IV, Sarks S, Killingsworth M. Evolution of reticular pseudodrusen. Br J Ophthalmol. 2011; 95(7): 979–985. [CrossRef] [PubMed]
Finger RP, Wu Z, Luu CD, et al. Reticular pseudodrusen: a risk factor for geographic atrophy in fellow eyes of individuals with unilateral choroidal neovascularization. Ophthalmology. 2014; 121(6): 1252–1256. [CrossRef] [PubMed]
Hogg RE, Silva R, Staurenghi G, et al. Clinical Characteristics of Reticular Pseudodrusen in the Fellow Eye of Patients with Unilateral Neovascular Age-Related Macular Degeneration. Ophthalmology. 2014; 121(9): 1748–1755. [CrossRef] [PubMed]
Pumariega NM, Smith RT, Sohrab MA, Letien V, Souied EH. A prospective study of reticular macular disease. Ophthalmology. 2011; 118(8): 1619–1625. [CrossRef] [PubMed]
Kim KL, Joo K, Park SJ, Park KH, Woo SJ. Progression from intermediate to neovascular age-related macular degeneration according to drusen subtypes: Bundang AMD cohort study report 3. Acta Ophthalmologica. 2022; 100(3): e710–e718. [PubMed]
Bui PTA, Reiter GS, Fabianska M, et al. Fundus autofluorescence and optical coherence tomography biomarkers associated with the progression of geographic atrophy secondary to age-related macular degeneration. Eye. 2022; 36(10): 2013–2019. [CrossRef] [PubMed]
Kovach JL, Schwartz SG, Agarwal A, et al. The Relationship Between Reticular Pseudodrusen and Severity of AMD. Ophthalmology. 2016; 123(4): 921–923, doi:10.1016/j.ophtha.2015.10.036. [CrossRef] [PubMed]
Cleland SC, Domalpally A, Liu Z, et al. Reticular Pseudodrusen Characteristics and Associations in the Carotenoids in Age-Related Eye Disease Study 2 (CAREDS2). Ophthalmology Retina. 2021; 5(8): 721–729. [CrossRef] [PubMed]
Domalpally A, Agrón E, Pak JW, et al. Prevalence, Risk, and Genetic Association of Reticular Pseudodrusen in Age-related Macular Degeneration. Ophthalmology. 2019; 126(12): 1659–1666. [CrossRef] [PubMed]
Wu Z, Fletcher EL, Kumar H, Greferath U, Guymer RH. Reticular pseudodrusen: A critical phenotype in age-related macular degeneration. Prog Retin Eye Res. 2022; 88: 101017. [CrossRef] [PubMed]
Zweifel SA, Imamura Y, Spaide TC, Fujiwara T, Spaide RF. Prevalence and significance of subretinal drusenoid deposits (reticular pseudodrusen) in age-related macular degeneration. Ophthalmology. 2010; 117(9): 1775–1781. [CrossRef] [PubMed]
Ueda-Arakawa N, Ooto S, Tsujikawa A, Yamashiro K, Oishi A, Yoshimura N. Sensitivity and specificity of detecting reticular pseudodrusen in multimodal imaging in Japanese patients. Retina (Philadelphia, Pa). 2013; 33(3): 490–497. [CrossRef] [PubMed]
Müller PL, Liefers B, Treis T, et al. Reliability of Retinal Pathology Quantification in Age-Related Macular Degeneration: Implications for Clinical Trials and Machine Learning Applications. Trans Vis Sci Tech. 2021; 10(3): 4. [CrossRef]
Guymer RH, Wu Z, Hodgson LAB, et al. Subthreshold nanosecond laser intervention in age-related macular degeneration the LEAD randomized controlled clinical trial. Ophthalmology. 2019; 126(6): 829–838. [CrossRef] [PubMed]
Liefers B, Taylor P, Alsaedi A, et al. Quantification of key retinal features in early and late age-related macular degeneration using deep learning. Am J Ophthalmol. 2021; 226: 1–12. [CrossRef] [PubMed]
Lee CS, Tyring AJ, Deruyter NP, Wu Y, Rokem A, Lee AY. Deep-learning based, automated segmentation of macular edema in optical coherence tomography. Biomed Opt Express. 2017; 8(7): 3440–3448. [CrossRef] [PubMed]
Schmidt-Erfurth U, Reiter GS, Riedl S, et al. AI-based monitoring of retinal fluid in disease activity and under therapy. Prog Retin Eye Res. 2022; 86: 100972. [CrossRef] [PubMed]
Venhuizen FG, Ginneken B van, Bloemen B, et al. Automated age-related macular degeneration classification in OCT using unsupervised feature learning. Proceedings of the SPIE, Volume 9414, ID 94141. 2015. Available at: https://ui.adsabs.harvard.edu/abs/2015SPIE.9414E.1IV/abstract NASA/ADS (harvard.edu).
Srinivasan PP, Kim LA, Mettu PS, et al. Fully automated detection of diabetic macular edema and dry age-related macular degeneration from optical coherence tomography images. Biomedical Optics Express. 2014; 5(10): 3568. [CrossRef] [PubMed]
Liu YY, Ishikawa H, Chen M, et al. Computerized Macular Pathology Diagnosis in Spectral Domain Optical Coherence Tomography Scans Based on Multiscale Texture and Shape Features. Invest Ophthalmol Vis Sci. 2011; 52(11): 8316–8322. [CrossRef] [PubMed]
Sun Y, Li S, Sun Z. Fully automated macular pathology detection in retina optical coherence tomography images using sparse coding and dictionary learning. J Biomed Optics. 2017; 22(1): 016012–016012. [CrossRef]
Huang L, He X, Fang L, Rabbani H, Chen X. Automatic Classification of Retinal Optical Coherence Tomography Images With Layer Guided Convolutional Neural Network. IEEE Signal Processing Letters. 2019; 26(7): 1026–1030. [CrossRef]
Saha S, Nassisi M, Wang M, et al. Automated detection and classification of early AMD biomarkers using deep learning. Sci Rep. 2019; 9(1): 10990. [CrossRef] [PubMed]
Mishra Z, Ganegoda A, Selicha J, Wang Z, Sadda SR, Hu Z. Automated Retinal Layer Segmentation Using Graph-based Algorithm Incorporating Deep-learning-derived Information. Sci Rep. 2020; 10(1): 9541. [CrossRef] [PubMed]
Querques G, Canouï-Poitrine F, Coscas F, et al. Analysis of Progression of Reticular Pseudodrusen by Spectral Domain–Optical Coherence Tomography. Invest Opthalmol Vis Sci. 2012; 53(3): 1264. [CrossRef]
Gonzalez-Gonzalo C, Thee EF, Liefers B, Klaver CCW, Cl Sanchez. Deep learning for automated stratification of ophthalmic images: Application to age-related macular degeneration and color fundus images. Presented at: EURETINA 2021; September 10, 2021. Available at: https://euretina.org/resource/abstract_2021_deep-learning-for-automated-stratification-of-ophthalmic-images-application-to-age-related-macular-degeneration-and-color-fundus-images/.
Liubimov MT, Malyuk M, Shevchenko N, Holmanyuk A, Nikolai . Label Studio: Data labeling software. Available at: https://github.com/heartexlabs/label-studio.
Christian S, Wei L, Jia Y, et al. Going Deeper with Convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Published online 2015: 1–9.
Ioffe S, Szegedy C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv Preprint:150203167 [cs]. Published online March 2, 2015. Accessed September 16, 2020. Available at: http://arxiv.org/abs/1502.03167.
Xu B, Wang N, Chen T, Li M. Empirical Evaluation of Rectified Activations in Convolutional Network. arXiv Preprint 1505.00853 [cs.LG]. Published online 2015. Available at: arxi https://arxiv.org/abs/1505.00853 v.org.
Kingma DP, Adam Ba J.: A Method for Stochastic Optimization. arXiv:14126980 [cs]. Published online January 29, 2017. Accessed September 13, 2020. Available at: http://arxiv.org/abs/1412.6980.
Lakshminarayanan B, Pritzel A, Blundell C. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. arXiv:161201474 [cs, stat]. Published online November 3, 2017. Accessed April 18, 2021. Available at: http://arxiv.org/abs/1612.01474.
Ronneberger O, Fischer P, U-Net Brox T.: Convolutional Networks for Biomedical Image Segmentation. arXiv:150504597 [cs]. Published online May 18, 2015. Accessed September 10, 2020. Available at: http://arxiv.org/abs/1505.04597.
Isensee F, Jaeger PF, Kohl SAA, Petersen J, Maier-Hein KH. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods. 2021; 18(2): 203–211. [CrossRef] [PubMed]
An introduction to ROC analysis | Pattern Recognition Letters. Accessed April 4, 2022. Availabler at: https://dl.acm.org/doi/10.1016/j.patrec.2005.10.010.
Cohen J. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin. 1968; 70(4): 213–220. [CrossRef] [PubMed]
Dice LR. Measures of the Amount of Ecologic Association Between Species. Ecology. 1945; 26(3): 297–302. [CrossRef]
Vallat R. Pingouin: statistics in Python. J Open Source Softw. 2018; 3(31): 1026. [CrossRef]
Koo TK, Li MY. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J Chiropractic Med. 2016; 15(2): 155–163. [CrossRef]
Bland JM, Altman DG. Measurement error. BMJ. 1996; 313(7059): 744. [CrossRef] [PubMed]
González-Gonzalo C, Thee EF, Klaver CCW, et al. Trustworthy AI: Closing the gap between development and integration of AI systems in ophthalmic practice. Prog Retin Eye Res. 2022; 90: 101034. [CrossRef] [PubMed]
Bernhaupt R, Mueller F “Floyd,” Verweij D, et al. A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. Published online 2020: 1–12. Available at: https://doi.org/10.1145/3313831.3376718.
Kauer J, Gawlik K, Beckers I. Automatic quality analysis of retinal optical coherence tomography. Presented at: ECTRIMS; November 1, 2018.
Kauer J, Gawlik K, Zimmermann HG, et al. Automatic quality evaluation as assessment standard for optical coherence tomography. Presented at conference: Advanced Biomedical and Clinical Diagnostic and Surgical Guidance Systems XVII. 2019; 10868: 1086814. Available at: https://www.researchgate.net/publication/331363523_Automatic_quality_evaluation_as_assessment_standard_for_optical_coherence_tomography.
A p S, Kar S, S G, Gopi VP, Palanisamy P. OctNET: A Lightweight CNN for Retinal Disease Classification from Optical Coherence Tomography Images. Comp Methods Programs Biomed. 2021; 200: 105877.
Reinke A, Eisenmann M, Tizabi MD, et al. Common Limitations of Image Processing Metrics: A Picture Story. arXiv Preprint. Published online 2021. Available at: https://arxiv.org/abs/2104.05642.
Fry A, Littlejohns TJ, Sudlow C, et al. Comparison of sociodemographic and health-related characteristics of UK biobank participants with those of the general population. Am J Epidemiol. 2017; 186(9): 1026–1034. [CrossRef] [PubMed]
Figure 1.
 
Deep learning framework for the detection and quantification of conventional drusen and reticular pseudodrusen. A classification algorithm classifies OCT volumes (on a volumetric (i.e. eye) level into gradable or ungradable, and ungradable volumes are removed (Ungradable Classification Model). A deep ensemble model for out-of-distribution detection identifies volumes with out-of-distribution scans, which are then removed (Outlier Detection Model). Another classification model, the Drusen/RPD classifier, classifies the remaining volumes into those with either drusen or RPD versus controls. Controls are removed (Drusen/RPD Classification Model). Finally, an image segmentation algorithm segments RPD and drusen separately on a B-scan level (Drusen/RPD Segmentation Model). RPD, reticular pseudodrusen.
Figure 1.
 
Deep learning framework for the detection and quantification of conventional drusen and reticular pseudodrusen. A classification algorithm classifies OCT volumes (on a volumetric (i.e. eye) level into gradable or ungradable, and ungradable volumes are removed (Ungradable Classification Model). A deep ensemble model for out-of-distribution detection identifies volumes with out-of-distribution scans, which are then removed (Outlier Detection Model). Another classification model, the Drusen/RPD classifier, classifies the remaining volumes into those with either drusen or RPD versus controls. Controls are removed (Drusen/RPD Classification Model). Finally, an image segmentation algorithm segments RPD and drusen separately on a B-scan level (Drusen/RPD Segmentation Model). RPD, reticular pseudodrusen.
Figure 2.
 
Eye and B-scan selection flowchart for model training, validation and testing. (A) Selection of eyes to train, validate, and test the classifiers. (B) Selection of B-scans to train, validate, and test the semantic segmentation model. AMD, age-related macular degeneration; OCT, optical coherence tomography; RPD, reticular pseudodrusen.
Figure 2.
 
Eye and B-scan selection flowchart for model training, validation and testing. (A) Selection of eyes to train, validate, and test the classifiers. (B) Selection of B-scans to train, validate, and test the semantic segmentation model. AMD, age-related macular degeneration; OCT, optical coherence tomography; RPD, reticular pseudodrusen.
Figure 3.
 
Receiver operating characteristic (ROC) curves for the three classification models. From left to right, the curves apply to the RPD/drusen vs. controls model, the Ungradable Classification Model, and the Outlier Detection Model. The orange line represents the models’ sensitivity at different thresholds with the shaded area representing the 95% confidence interval.
Figure 3.
 
Receiver operating characteristic (ROC) curves for the three classification models. From left to right, the curves apply to the RPD/drusen vs. controls model, the Ungradable Classification Model, and the Outlier Detection Model. The orange line represents the models’ sensitivity at different thresholds with the shaded area representing the 95% confidence interval.
Figure 4.
 
Free response receiver operating characteristic (FROC) curves for drusen, stage 1 reticular pseudodrusen (RPD), stages 2, 3, and 4 RPD, and all RPD stages combined, comparing the model to the ground truth graders. The line represents model sensitivity at different thresholds, with the shaded area representing the 95% confidence interval, obtained by bootstrapping. The dots represent the two other graders, one represented in blue and the other in yellow, with error bars representing 95% confidence intervals.
Figure 4.
 
Free response receiver operating characteristic (FROC) curves for drusen, stage 1 reticular pseudodrusen (RPD), stages 2, 3, and 4 RPD, and all RPD stages combined, comparing the model to the ground truth graders. The line represents model sensitivity at different thresholds, with the shaded area representing the 95% confidence interval, obtained by bootstrapping. The dots represent the two other graders, one represented in blue and the other in yellow, with error bars representing 95% confidence intervals.
Figure 5.
 
Comparison of model and grader output, reticular pseudodrusen (RPD) stage 2, 3, or 4. The green color represents stage 2, 3, or 4 RPD. Grader 1 was used as a reference standard against model performance and against other graders, presented in the FROC curves.
Figure 5.
 
Comparison of model and grader output, reticular pseudodrusen (RPD) stage 2, 3, or 4. The green color represents stage 2, 3, or 4 RPD. Grader 1 was used as a reference standard against model performance and against other graders, presented in the FROC curves.
Figure 6.
 
Comparison of model and grader output, reticular pseudodrusen (RPD) stage 1 and RPD stages 2, 3, or 4. Stage 1 RPD is represented in white and stage 2, 3, and 4 are represented in green. Grader 1 was used as a reference standard against model performance and against other graders, presented in the FROC curves.
Figure 6.
 
Comparison of model and grader output, reticular pseudodrusen (RPD) stage 1 and RPD stages 2, 3, or 4. Stage 1 RPD is represented in white and stage 2, 3, and 4 are represented in green. Grader 1 was used as a reference standard against model performance and against other graders, presented in the FROC curves.
Figure 7.
 
Comparison of model and grader output, conventional drusen, reticular pseudodrusen (RPD) stage 1 and RPD stages 2, 3, or 4. Drusen are represented in blue, stage 1 RPD in white, and stage 2, 3, and 4 are represented in green. Grader 1 was used as a reference standard against model performance and against other graders, presented in the FROC curves.
Figure 7.
 
Comparison of model and grader output, conventional drusen, reticular pseudodrusen (RPD) stage 1 and RPD stages 2, 3, or 4. Drusen are represented in blue, stage 1 RPD in white, and stage 2, 3, and 4 are represented in green. Grader 1 was used as a reference standard against model performance and against other graders, presented in the FROC curves.
Table 1.
 
Metrics Obtained by the Different Classification Models on the Test Set
Table 1.
 
Metrics Obtained by the Different Classification Models on the Test Set
Table 2.
 
ICC Scores on the Test Set for the Model and Graders
Table 2.
 
ICC Scores on the Test Set for the Model and Graders
Table 3.
 
Bland-Altman Repeatability Coefficient Scores (in µm2) on the Test Set for the Model and Graders
Table 3.
 
Bland-Altman Repeatability Coefficient Scores (in µm2) on the Test Set for the Model and Graders
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×