Translational Vision Science & Technology Cover Image for Volume 14, Issue 6
June 2025
Volume 14, Issue 6
Open Access
Artificial Intelligence  |   June 2025
Optic Cup and Disc Segmentation of Fundus Images Using Artificial Intelligence Externally Validated With Optical Coherence Tomography Measurements
Author Affiliations & Notes
  • Scott Kinder
    Department of Ophthalmology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
  • Steve McNamara
    Department of Ophthalmology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
  • Christopher Clark
    Department of Ophthalmology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
  • Benjamin Bearce
    Department of Ophthalmology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
    Athinoula A. Martinos Center for Biomedical Imaging, Charlestown, MA, USA
  • Upasana Thakuria
    Department of Ophthalmology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
  • Yoga Advaith Veturi
    Department of Ophthalmology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
  • Galia Deitz
    Department of Ophthalmology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
  • Talisa E. de Carlo Forest
    Department of Ophthalmology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
  • Naresh Mandava
    Department of Ophthalmology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
  • Malik Y. Kahook
    Department of Ophthalmology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
  • Praveer Singh
    Department of Ophthalmology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
  • Jayashree Kalpathy-Cramer
    Department of Ophthalmology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
  • Correspondence: Jayashree Kalpathy-Cramer, Department of Ophthalmology, University of Colorado Anschutz Medical Campus, 1675 Aurora Court, Mail Stop F731, Aurora, CO 80045, USA. e-mail: [email protected] 
Translational Vision Science & Technology June 2025, Vol.14, 30. doi:https://doi.org/10.1167/tvst.14.6.30
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Scott Kinder, Steve McNamara, Christopher Clark, Benjamin Bearce, Upasana Thakuria, Yoga Advaith Veturi, Galia Deitz, Talisa E. de Carlo Forest, Naresh Mandava, Malik Y. Kahook, Praveer Singh, Jayashree Kalpathy-Cramer; Optic Cup and Disc Segmentation of Fundus Images Using Artificial Intelligence Externally Validated With Optical Coherence Tomography Measurements. Trans. Vis. Sci. Tech. 2025;14(6):30. https://doi.org/10.1167/tvst.14.6.30.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose: To develop an artificial intelligence (AI) optic cup and disc segmentation pipeline for obtaining optic nerve head (ONH) measurements such as vertical cup-to-disc ratio (VCDR) from fundus images and externally validate performance against optical coherence tomography (OCT) measurements.

Methods: This diagnostic study used a retrospectively collected dataset of 27,252 fundus images associated with 12,477 OCT reports and 21,714 expert assessments of VCDR from electronic health records (EHRs) for 4289 patients inclusive of glaucoma suspects, primary and secondary glaucoma. The AI pipeline was trained on nine public glaucoma datasets and externally validated on a private hospital dataset and a publicly available dataset.

Results: AI VCDR predictions against OCT yielded mean absolute error (MAE), Pearson’s R, and concordance correlation coefficient (CCC) values of 0.097 (95% confidence interval [CI], 0.095–0.099), 0.80 (95% CI, 0.79–0.81), and 0.66 (95% CI, 0.64–0.67), respectively. EHR VCDRs against OCT had MAE, Pearson’s R, and CCC values of 0.086 (95% CI, 0.084–0.087), 0.77 (95% CI, 0.76–0.78), and 0.74 (95% CI, 0.73–0.75), respectively. The coefficient of variation (CV) of the AI pipeline on same-day images was 2.79%.

Conclusions: The proposed AI pipeline had strong correlation with OCT measurements and performed comparably to EHR assessments, with high repeatability. Increased diversity and cardinality of training data improved performance and generalizability to unseen datasets.

Translational Relevance: AI pipelines for fundus images can provide ONH measurements such as VCDR near expert level in new patient populations without the need for additional model training.

Introduction
Glaucoma is a leading cause of irreversible blindness worldwide and is expected to affect 111.8 million people by 2040.1,2 These numbers highlight the insidious nature of the disease, need for early detection and monitoring, and importance of timely intervention to prevent permanent visual impairment.1,3 Thus, diagnosis and precise monitoring of glaucomatous change is a vital task. Clinically, assessment of the optic disc and cup and visualization of structural changes therein are central to the management of glaucoma.4 Furthermore, the evaluation of optic nerve head (ONH) morphology includes assessment of the proportions of optic disc to optic cup and optic disc to neuroretinal rim (NRR). Cup-to-disc ratio (CDR) and, more specifically, vertical cup-to-disc ratio (VCDR) are common clinical measures for assessing ONH morphology4 and are indicators for the presence of glaucoma,512 with CDRs of greater than 0.5 to 0.6 associated with glaucomatous optic neuropathy.13,14 Methods for tracking the rim-to-disc ratio (RDR) include the minimum rim-to-disc ratio (MRDR), inferior > superior > nasal > temporal NRR thickness (ISNT rule), and more recently the disc damage likelihood scale (DDLS), which takes localized NRR loss into account.15 Studies on the DDLS have shown better assessment of visual field loss,1517 but its lack of widespread adoption as a clinical measure remains a barrier to usage. 
Regardless of methodology, ONH assessments of VCDR, RDR, and other similar measurements can be implemented using fundus photograph evaluations for glaucoma management. This is highly relevant in low resource and remote settings, due to the relative cost-effectiveness of cameras, mobility of some fundus imaging devices, and reduced availability or accessibility of advanced clinical expertise.1820 Because manual image assessment of fundus photographs can be laborious, time consuming, and subjective,18,21 automated and machine learning methods have been developed to determine both CDR2226 and RDR16,2729 to better classify glaucoma. However, deep learning (DL) methods often fail to achieve satisfactory performance across different patient populations on which they have not been specifically trained.30 Many studies show effective segmentation techniques on specific datasets but typically use data from a limited number of patient ethnicities or disease severities16,2129 and perform significantly worse when applied to new datasets that are not familiar to the utilized algorithms.3032 
Through our artificial intelligence (AI) pipeline trained on numerous public datasets with a novel cropping approach, we have demonstrated the ability to acquire ONH measurements with fundus images on a large, retrospective dataset completely held out from training. This pipeline represents a step forward in optic disc and cup segmentation that can be immediately applied to new fundus datasets, helping clinicians better manage glaucoma and monitor progression. 
Methods
Datasets
We used three sources of data in this analysis: a collection of nine publicly available glaucoma datasets used to both train and test DL models in the pipeline, a retrospectively collected private hospital dataset used exclusively as a test set and held out entirely from pipeline training, and the PAPILA33 dataset as public external validation. The nine public datasets are referred to here as “public data,” and the retrospectively collected private hospital dataset are referred to as “private data.” The PAPILA dataset is referred to as simply PAPILA. 
The public data consists of fundus images with labeled segmentations and contains both glaucomatous (964) and non-glaucomatous (3837) eyes. The datasets that were used included Chákṣu, Drishti-GS, G1020, ORIGA, REFUGE, RIM-ONE DL, and the RIGA collection, which is comprised of Bin Rushed, Magrabi, and Messidor3440 (Supplementary Table S1, Supplementary Fig. S1). The datasets vary greatly in terms of their geography, demographics, camera imaging specifications, and expert labeling. For datasets that contain multiple expert annotated segmentations for a single image (specifically, Chákṣu, Drishti-GS, and RIGA datasets), a STAPLE41 average segmentation label was used. Label VCDRs were provided by several datasets and were otherwise calculated based on the ground-truth segmentation label. 
The private data were retrospectively collected from the UCHealth Sue Anschutz-Rodgers Eye Center and used in accordance with institutional review board guidelines. The dataset consisted of 27,252 fundus images from 4289 patients which contained the keyword “glaucoma” in their assessment and plan, and it was inclusive of glaucoma suspects, primary glaucoma, and secondary glaucoma. Images and medical records were obtained from 2012 to 2022 and included longitudinal data for 5323 eyes, for which imaging and clinical data were available across multiple time points (Supplementary Table S2). Demographically, the dataset contained multiple ethnicities and spanned a very wide age range, with a roughly even gender split (Supplementary Table S3). Two different fundus camera models were used, both providing full fundus and stereoscopic views (Supplementary Fig. S2). VCDRs from retinal nerve fiber layer optical coherence tomography (OCT) reports from a CIRRUS HD-OCT system (ZEISS, Oberkochen, Germany) were used as the gold standard comparison when available within 6 months of fundus imaging, as the VCDR provides a more analogous and objective comparison with the segmentation model. In instances where multiple OCT reports were available before and after imaging, VCDRs were interpolated relative to the number of days from imaging and averaged if there were multiple ones on the same day (Supplementary Material). After this process, a total of 12,477 OCT VCDRs were associated with fundus images. 
The PAPILA dataset contains 488 full fundus images from 244 patients (488 eyes) from a single institution in Spain. Images were captured on a TRC-NW400 non-mydriatic retinal camera (Topcon Healthcare, Tokyo, Japan), which was not listed as a camera used for any of the training images. The patient cohort included individuals with chronic glaucoma as well as those without any ocular pathology. Optic cup/disc segmentations by two expert ophthalmologists were available, which were simultaneous truth and performance level estimation (STAPLE) averaged to create the label for reporting. The PAPILA dataset also provides clinical data including glaucoma diagnosis, refractive error, axial length, and more. We inferred myopic and hyperopic status from the spherical equivalent, where myopia is defined as low (−0.50 ≥ low > −3.00), moderate (−3.00 ≥ moderate > −6.00), and high (−6.00 ≥ high); hyperopia is defined as low to moderate (0.50 ≤ low to moderate < 3.0) and high (3.0 ≤ high); and emmetropia is defined as (−0.5 < emmetropia < 0.5), which is a relatively common delineation,42 although exact values and definitions can vary. 
AI Pipeline Architecture
The AI pipeline consists of two main steps: detecting the ONH and cropping the image with square resizing and padding, then segmenting the cup/disc regions and recovering the segmentation on the original uncropped image. Both steps of the pipeline are supported by DL models, which further process the input until obtaining the segmentation from which the ONH measurements are calculated (Fig. 1). 
Figure 1.
 
Architecture for the proposed AI pipeline to handle both full fundus and stereoscopic images. Images are cropped tightly to the ONH before being segmented by the segmentation model and are then recovered onto the original image.
Figure 1.
 
Architecture for the proposed AI pipeline to handle both full fundus and stereoscopic images. Images are cropped tightly to the ONH before being segmented by the segmentation model and are then recovered onto the original image.
The first step, detection and cropping to the ONH, uses a YOLOv843 object detection model, which is referred to here as the “detection model.” By proposing bounding box regions with a probability score, the detection model attempts to localize the ONH in the image. To ensure high-quality input for downstream tasks, a minimum detection threshold (probability > 0.9) was employed to filter out poor-quality images. This detection threshold was chosen based on being commonly used for reporting44 and is often cited as being necessary for high-quality detection.45,46 When the bounding box is obtained, the fundus image is cropped and resized into a square with specific amounts of padding (25 pixels) around the ONH to provide consistent input. This novel procedure attempts to standardize the output for the next step by providing even padding from the ONH extent to the image border (Supplementary Fig. S3). 
The second step, segmenting the cup and disc, uses a Mask2Former (Swin Backbone) model,47,48 which is referred to here as the “segmentation model.” Given a cropped color fundus photograph, a segmentation mask for the cup, disc, and background is produced. The cropping procedure is then reversed to recover the segmentation mask back onto the original fundus image to calculate ONH measurements (Supplementary Fig. S3). 
Evaluation Metrics
We assess both the individual DL models and the end-to-end pipeline. To evaluate the segmentation model, the Sørensen–Dice (Dice) coefficient and Jaccard index were used. These metrics for segmentation evaluation measure the overlap between the predicted and ground-truth regions. The mean value for all images across the dataset was used for reporting (Supplementary Material). The detection model was evaluated with average precision, which is the area under the precision–recall curve. To determine true positives, the prediction must have a Jaccard index above a certain threshold to be a true positive. We report with thresholds of 50% and the series 50% to 95% with step size 5%, also referred to here as AP50 and mAP50:95 (Supplementary Material). 
The combined end-to-end pipeline uses mean absolute error (MAE), Pearson’s R, and concordance correlation coefficient (CCC) to determine performance and ability to produce ONH measurements. Both AI predictions and electronic health record (EHR) assessments were compared to the OCT VCDRs. To measure the intrarater variability of the AI pipeline on same-day images, the coefficient of variation (CV) was used. AI VCDR predictions were obtained by extracting both the cup and disc segmentation masks, finding the number of pixels of the longest vertical line in each mask, and then dividing (Supplementary Fig. S4). 
Pipeline Training and Validation
The two models that the pipeline utilizes, detection and segmentation, were trained on the same subsets from the public data. Each public dataset was split into randomly sampled 80/10/10 train, validation, and test sets to form the combined public train, validation, and test sets. The detection model was trained with bounding box labels that came from the segmentation labels by setting the bounding box to be the extent of the disc segmentation (Supplementary Fig. S3a). The segmentation model used augmentations for images in the training set, and normalization was applied to all images (Supplementary Material). For both the detection and segmentation models, the optimal model was selected based on performance on the public test set. 
To validate the performance of the trained AI pipeline on ONH measurements, it was applied to all fundus images in the private data. Images that were confidently classified as containing an ONH by the detection model (detection threshold > 0.9) were further processed, and the rest were discarded. Evaluations for fixed time points were reported at the per-image level (Supplementary Material). All code, model weights, and technical implementation details are publicly available on GitHub (https://github.com/QTIM-Lab/fundus_detection_segmentation_pipeline). 
Results
Across all public data test sets, the MAE, Pearson’s R, and CCC values between ground-truth labeled and AI predicted VCDRs were 0.051 (95% confidence interval [CI], 0.047–0.055), 0.86 (95% CI, 0.83–0.88), and 0.84 (95% CI, 0.81–0.86) (TableFig. 2). The optimal AI pipeline yielded the highest benchmark Jaccard and Dice indices for cup segmentation on six out of the nine datasets (Bin Rushed, Chákṣu, G1020, Magrabi, Messidor, and RIM-ONE DL) (Supplementary Table S4).22,30,4955 For disc segmentation, the highest benchmark results were obtained on seven out of the nine datasets (all but REFUGE and Drishti-GS).22,30,4955 To assess the impact of using the combined nine public datasets, individual segmentation models were trained and evaluated on each dataset (Supplementary Table S5). In all public datasets, we observed worse cup segmentation performance on the individually trained models versus the combined model. A comparison of the resizing with padding techniques for cropping is shown for public data test sets (Supplementary Table S6). For disc Dice score, adding some padding led to improvement in all datasets, and 25 pixels of padding were marginally better than 50 pixels in five of nine datasets. Resizing to square added improvement in four of nine datasets, especially the difficult G1020 dataset, but was worse on Messidor and Magrabi. The AP50 and mAP50:95 values for all test full fundus images in the public data were 0.995 and 0.922, respectively, with all ONHs being successfully detected (Supplementary Figs. S5, S6). 
Table.
 
VCDR Evaluation Results for Private and Public Data
Table.
 
VCDR Evaluation Results for Private and Public Data
Figure 2.
 
AI VCDR predictions versus public data label VCDR. Public test set VCDR scatterplot; G1020 contained many samples without an optic cup segmentation.
Figure 2.
 
AI VCDR predictions versus public data label VCDR. Public test set VCDR scatterplot; G1020 contained many samples without an optic cup segmentation.
On the private data, of the 27,252 fundus images, 18,109 ONHs were successfully processed with a detection threshold of 0.9 (Supplementary Table S7). For fixed time point AI VCDR predictions against OCT VCDR measurements, the MAE, Pearson's R, and CCC were 0.097 (95% CI, 0.095–0.099), 0.80 (95% CI, 0.79–0.81), and 0.66 (95% CI, 0.64–0.67), respectively (TableFig. 3). For comparison, the EHR VCDR assessments against OCT VCDR measures had MAE, Pearson's R, and CCC values of 0.086 (95% CI, 0.084–0.087), 0.77 (95% CI, 0.76–0.78), and 0.74 (95% CI, 0.73–0.75), respectively. The strong correlation of the AI pipeline with OCT surpassed the EHR performance but had lower concordance and higher absolute error. A variant of a Bland–Altman plot showed that the AI prediction typically underestimated OCT VCDRs but otherwise had low variance, except for very small VCDRs (Supplementary Fig. S7). The error distribution of the AI pipeline VCDRs versus the OCT VCDRs was approximately bimodal, with larger errors at the tails (especially for low VCDRs), and a small error in the middle (Supplementary Fig. S8). In general, as the detection threshold increased and more images became filtered out, the performance of the AI pipeline also increased (Supplementary Table S8, Supplementary Fig. S9). As the detection threshold increased from 0.0 to 0.9, it led to an 11% reduction of MAE from 0.109 to 0.097, but 33.55% of images were filtered out. 
Figure 3.
 
AI and EHR VCDR predictions versus OCT. (A) Scatterplot of AI VCDR predictions versus OCT VCDR measurements with a detection threshold above 0.9 on the retrospectively collected private data. (B) EHR VCDR assessments against OCT VCDR measurements with a detection threshold above 0.9 on the retrospectively collected private data.
Figure 3.
 
AI and EHR VCDR predictions versus OCT. (A) Scatterplot of AI VCDR predictions versus OCT VCDR measurements with a detection threshold above 0.9 on the retrospectively collected private data. (B) EHR VCDR assessments against OCT VCDR measurements with a detection threshold above 0.9 on the retrospectively collected private data.
Longitudinal results were evaluated based on the delta change from time points and comparing the difference between modalities (Supplementary Material). The AI pipeline VCDR delta change predictions had an MAE of 0.039 (95% CI, 0.037–0.040) when compared to the OCT VCDR delta change measurements. EHR VCDR delta change assessments compared to OCT VCDR delta changes had an MAE of 0.037 (95% CI, 0.035–0.038). Examples of longitudinal monitoring are shown in Figure 4, which demonstrate that the ability of the AI pipeline to capture VCDRs increases over time for progressive cases, in addition to providing consistent, repeatable results on stagnant cases (Supplementary Fig. S10). Furthermore, the AI pipeline VCDR prediction variability on same-day images (3922 unique images on 1363 eyes) had a mean CV of 2.79%, which is comparable to or below that of OCT devices.56,57 
Figure 4.
 
Longitudinal example of a patient eye. AI segmentation masks and fundus images are shown above and below the AI prediction values.
Figure 4.
 
Longitudinal example of a patient eye. AI segmentation masks and fundus images are shown above and below the AI prediction values.
Across the entire PAPILA dataset, the AI pipeline had average cup and disc segmentation Dice indexes of 0.707 and 0.957, respectively, indicating better performance than the expert annotators against themselves on disc but worse on cup segmentation.33 AI pipeline performance on PAPILA showed gradually improving performance as the detection threshold increased (Supplementary Table S9). The AI pipeline struggled with cup segmentations for myopic and hyperopic patients, as segmentation performance decreased as absolute values of spherical equivalent increased, except for high myopia, although that may be an outlier with only eight samples (Supplementary Fig. S11). This is consistent with the reduced performance for peripapillary atrophy (PPA) and tilted disc for the private data (Supplementary Table S8, Supplementary Fig. S12). 
Discussion
Accurately and precisely obtaining ONH measurements such as VCDR on diverse fundus imaging is feasible and could aid in glaucoma assessment and longitudinal monitoring worldwide. For clinical applications or tests where identifying gradual progression is important, our AI pipeline provides repeatable measurements with performance comparable to that of experts. Our results on the retrospective private data show applicability with VCDR assessment, which could also translate to other segmentation-based ONH measurements, although only VCDR is evaluated using the AI pipeline. This provides avenues for care and research to be performed in underserved communities where OCT and expert assessments may be scarce. Moreover, the AI pipeline was trained on numerous public datasets from around the world and yielded near-expert performance on completely new private and public datasets. Thus, our pipeline is adaptable to different patient populations and image characteristics without requiring additional training, although it can still have relatively small performance differences across varying demographics (Supplementary Table S10). 
Typically, a major hurdle in deploying AI-based solutions in real-world tasks is getting DL models to adapt to new, heterogeneous data. Because the private data were demographically heterogeneous, three main techniques were used to combat this issue. First, it was vital to train on diverse datasets. Individually trained segmentation models performed worse compared to the segmentation model trained on all of the datasets combined (Supplementary Table S5). Simply by adding more datasets, the DL models were able to become more agnostic to the spurious features of demographics and imaging devices and instead focus on the morphological signal. Second, pixel-level augmentations were applied during training to further increase the diversity of the data. Third, and our novel contribution in this space, was the cropping and square resizing with padding strategy used to optimally standardize images for the segmentation model. 
We compared our cropping and square resizing with the padding method against cropping without square resizing and with varying amounts of padding on public datasets (Supplementary Table S6). Resizing the cropped image performed well compared to not resizing in G1020, potentially because G1020 is a difficult dataset; thus, consistent inputs may help with difficult segmentations. Samples from the G1020 dataset showed that some fundus images can be very off-center, and the ONH/optic disc margins are less distinct from the peripapillary retina, making segmentation difficult without standardizing the input (Supplementary Fig. S6). Adding no padding was either equivalent or marginally worse in all datasets when compared to resized and padded models, in part because padding allows for the detection model to have some inaccuracy without excluding any disc pixels. The best performance across the entire test set was observed to have 25 pixels in the proposed method. Although the improvement was very marginal or even slightly worse in some datasets, we showed in general that there is an optimal amount of padding to use, but the exact padding value may be architecture or dataset dependent. Past researchers have used various cropping techniques22,24,29 but to our knowledge have not explored square resizing and adding specific padding to optimally standardize inputs for the segmentation step. Although this pipeline relies on high-quality detection necessary to provide consistent crops, ONH detection is a relatively easy task, as evidenced by the high mAP50:95 scores on public test data (0.922). Additionally, a high detection threshold can be employed if accuracy and precision are paramount, such as when measuring incremental change. The amount of performance gain from increasing the detection threshold is roughly proportional to the number of images that will be discarded, until very few images remain and the statistics become unstable (Supplementary Fig. S9). 
Progression monitoring through longitudinal analysis of VCDR is a key component of glaucoma care for many reasons, including treatment response, but requires precise measurements to capture incremental change. Because the AI pipeline showed MAE performance nearly equivalent to that of EHR assessments when predicting delta change compared to OCT, progression monitoring is feasible near the expert level. Therefore, AI-based VCDR measurements could serve as a reliable tool for disease progression monitoring. The low CV of the AI pipeline on same-day predictions demonstrates good repeatability, which provides confidence that predicted progression is due to morphological change and not image or environmental artifacts. Although same-day images can vary greatly in terms of image characteristics, the AI pipeline is relatively agnostic to these differences. Examples of progression monitoring are shown in Supplementary Figure S10, which demonstrates the ability to capture incremental VCDR changes on longitudinal cases. 
There are several key limitations to the study and AI pipeline. First, with regard to the study, the private data included only glaucoma suspects and those diagnosed with primary and secondary glaucoma. However, the PAPILA dataset does include both glaucomatous and non-glaucomatous eyes. Although the training data were mostly non-glaucomatous (Supplementary Table S1), the AI pipeline surprisingly performed better on glaucomatous eyes (Supplementary Table S9). Second, only two external validation datasets were used to collect results. Although the private data represented a relatively large and heterogeneous dataset, and PAPILA contained glaucomatous and non-glaucomatous samples, more validation on external datasets would greatly increase confidence in the robustness of the pipeline. 
The AI pipeline itself also has some limitations. First, it relies on high-quality input and may filter out many samples that do not pass the detection threshold limit to perform adequately. Therefore, a detection threshold of 0.9 is used to ensure high-quality input, which can filter out a significant number of images depending on the quality of the dataset. This led to a third of fundus images (33.55%) being filtered out of the private data and 33.61% being filtered out in PAPILA. However, evaluation of performance as detection threshold varies showed marginal performance improvement as the threshold increased (Supplementary Tables S8, S9). 
Second, the AI pipeline struggles with myopic and hyperopic eyes (Supplementary Table S9, Supplementary Fig. S11), and specifically with anatomical changes such as peripapillary atrophy (PPA) and tilted discs in the private dataset (Supplementary Fig. S12). Future work should consider focusing on improving performance for these patients. Third, the AI pipeline struggles with very low and high VCDR values, leading to a bimodal error distribution concentrated at the tails (specifically, very low VCDR values), but this is not the case for EHR estimates of VCDR versus OCT MAE which are approximately normal or right skewed (Supplementary Fig. S8). Because optic cup and disc segmentation datasets often have relatively high interrater variability, especially for cup segmentations,33,35 a model that favors more conservative predictions may generalize better. 
Fourth, the AI pipeline can have errors with cropping and slightly alter anatomical features during the resizing step before segmentation, which could lead to predictions based on distorted features, even though the segmentations are recovered onto the original imaging and calculated from there. This is shown in Supplementary Figure S13, where PPA confused the detection model, leading to segmentation of the PPA as a disc, and the tilted disc had a circular segmentation when resized even though the actual disc was ellipsoidal. Fifth, possibly related to the fourth issue, is that the AI pipeline generally underpredicted VCDR values on the private data, except for very low values. Although this could be due to anatomical distortion, other factors could contribute to this phenomenon, such as the tendency for fundus disc segmentations to be enlarged relative to OCT measurements, whereas cup segmentations are roughly similar.58 Sixth, varying demographic groups can have relatively small amounts of difference in performance. The biggest difference is in Caucasian patients, with 0.119 predicted VCDR MAE against OCT, compared to African American patients with 0.083 predicted VCDR MAE against OCT (Supplementary Table S10). 
In summary, the findings of this retrospective study demonstrate that AI fundus cup and disc segmentation pipelines can obtain ONH measurements such as VCDR comparable to or above expert level when evaluated against OCT. The use of diverse training data with our homogenizing cropping technique showed enhanced segmentation results on established datasets and matched expert performance on the private data and PAPILA dataset. The low error for both fixed time points and longitudinal changes, along with high repeatability, demonstrates the ability to measure true cupping progression and could aid in longitudinal studies. With the ability to obtain these metrics, AI pipelines for fundus imaging could provide new opportunities to improve care worldwide, without large performance degradation or need for additional model training. 
Acknowledgments
Supported by unrestricted departmental funding from Research to Prevent Blindness (New York, NY). 
Disclosure: S. Kinder, None; S. McNamara, Evolution Optiks (C); C. Clark, None; B. Bearce, None; U. Thakuria, None; Y.A. Veturi, None; G. Deitz, None; T.E. de Carlo Forest, Genentech (C); N. Mandava, Soma Logic (C, R), ONL Therapeutics (C), Alcon (P), 2C Tech (P, I, O), Aurea Medical (I, O); M.Y. Kahook, New World Medical (R, C), Alcon (R), SpyGlass Pharma (O, C, R); P. Singh, None; J. Kalpathy-Cramer, Boston AI Lab (R), Genentech (F), GE Healthcare (F), Siloam Vision (C) 
References
Stein JD, Khawaja AP, Weizer JS. Glaucoma in adults—screening, diagnosis, and management: a review. JAMA. 2021; 325(2): 164–174. [CrossRef] [PubMed]
Tham Y-C, Li X, Wong TY, Quigley HA, Aung T, Cheng C-Y. Global prevalence of glaucoma and projections of glaucoma burden through 2040: a systematic review and meta-analysis. Ophthalmology. 2014; 121(11): 2081–2090. [CrossRef] [PubMed]
Tatham AJ, Weinreb RN, Medeiros FA. Strategies for improving early detection of glaucoma: the combined structure–function index. Clin Ophthalmol. 2014; 8: 611–621. [PubMed]
Fleming C, Whitlock EP, Beil T, Smit B, Harris RP. Screening for primary open-angle glaucoma in the primary care setting: an update for the US preventive services task force. Ann Fam Med. 2005; 3(2): 167–170. [CrossRef] [PubMed]
Hart WM, Yablonski M, Kass MA, Becker B. Multivariate analysis of the risk of glaucomatous visual field loss. Arch Ophthalmol. 1979; 97(8): 1455–1458. [CrossRef] [PubMed]
Drance SM, Schulzer M, Thomas B, Douglas GR. Multivariate analysis in glaucoma: use of discriminant analysis in predicting glaucomatous visual field damage. Arch Ophthalmol. 1981; 99(6): 1019–1022. [CrossRef] [PubMed]
Leske MC, Connell AM, Wu S-Y, et al. Incidence of open-angle glaucoma: the Barbados Eye Studies. Arch Ophthalmol. 2001; 119(1): 89–95. [PubMed]
Gordon MO, Beiser JA, Brandt JD, et al. The Ocular Hypertension Treatment Study: baseline factors that predict the onset of primary open-angle glaucoma. Arch Ophthalmol. 2002; 120(6): 714–720. [CrossRef] [PubMed]
Le A, Mukesh BN, McCarty CA, Taylor HR. Risk factors associated with the incidence of open-angle glaucoma: the visual impairment project. Invest Ophthalmol Vis Sci. 2003; 44(9): 3783–3789. [CrossRef] [PubMed]
Crowston J, Hopley C, Healey P, Lee A, Mitchell P. The effect of optic disc diameter on vertical cup to disc ratio percentiles in a population based cohort: the Blue Mountains Eye Study. Br J Ophthalmol. 2004; 88(6): 766. [CrossRef] [PubMed]
Bengtsson B, Heijl A. A long-term prospective study of risk factors for glaucomatous visual field loss in patients with ocular hypertension. J Glaucoma. 2005; 14(2): 135–138. [CrossRef] [PubMed]
Tsutsumi T, Tomidokoro A, Araie M, Iwase A, Sakai H, Sawaguchi S. Planimetrically determined vertical cup/disc and rim width/disc diameter ratios and related factors. Invest Ophthalmol Vis Sci. 2012; 53(3): 1332–1340. [CrossRef] [PubMed]
Varma R, Steinmann WC, Scott IU. Expert agreement in evaluating the optic disc for glaucoma. Ophthalmology. 1992; 99(2): 215–221. [CrossRef] [PubMed]
Harizman N, Oliveira C, Chiang A, et al. The ISNT rule and differentiation of normal from glaucomatous eyes. Arch Ophthalmol. 2006; 124(11): 1579–1583. [CrossRef] [PubMed]
Spaeth GL, Henderer J, Liu C, et al. The disc damage likelihood scale: reproducibility of a new method of estimating the amount of optic nerve damage caused by glaucoma. Trans Am Ophthalmol Soc. 2002; 100: 181. [PubMed]
Kumar JH, Seelamantula CS, Kamath YS, Jampala R. Rim-to-disc ratio outperforms cup-to-disc ratio for glaucoma prescreening. Sci Rep. 2019; 9(1): 7099. [CrossRef] [PubMed]
Formichella P, Annoh R, Zeri F, Tatham AJ. The role of the disc damage likelihood scale in glaucoma detection by community optometrists. Ophthalmic Physiol Opt. 2020; 40(6): 752–759. [CrossRef] [PubMed]
Gomez-Ulla F, Alonso F, Aibar B, Gonzalez F. A comparative cost analysis of digital fundus imaging and direct fundus examination for assessment of diabetic retinopathy. Telemed J E Health. 2008; 14(9): 912–918. [CrossRef] [PubMed]
Shanmugam M, Mishra D, Madhukumar R, Ramanjulu R, Reddy S, Rodrigues G. Fundus imaging with a mobile phone: a review of techniques. Indian J Ophthalmol. 2014; 62(9): 960–962. [CrossRef] [PubMed]
Chan WH, Shilling JS, Michaelides M. Optical coherence tomography: an assessment of current training across all levels of seniority in 8 ophthalmic units in the United Kingdom. BMC Ophthalmol. 2006; 6(1): 33. [CrossRef] [PubMed]
Coan LJ, Williams BM, Krishna Adithya V, et al. Automatic detection of glaucoma via fundus imaging and artificial intelligence: a review. Surv Ophthalmol. 2023; 68(1): 17–41. [CrossRef] [PubMed]
Kim J, Tran L, Chew EY, Antani S. Optic Disc and Cup Segmentation for Glaucoma Characterization Using Deep Learning. Piscataway, NJ: Institute of Electrical and Electronics Engineers; 2019: 489–494.
Fernandez-Granero M, Sarmiento A, Sanchez-Morillo D, Jiménez S, Alemany P, Fondón I. Automatic CDR estimation for early glaucoma diagnosis. J Healthc Eng. 2017; 2017: 5953621. [CrossRef] [PubMed]
Guo J, Azzopardi G, Shi C, Jansonius NM, Petkov N. Automatic determination of vertical cup-to-disc ratio in retinal fundus images for glaucoma screening. IEEE Access. 2019; 7: 8527–8541. [CrossRef]
Wang J, Xia B. CDRNet: accurate cup-to-disc ratio measurement with tight bounding box supervision in fundus photography using deep learning. Multimedia Tools Appl. 2023; 82(11): 16455–16477. [CrossRef]
Romero M, Lim V, Loon SC. Reliability of graders and comparison with an automated algorithm for vertical cup-disc ratio grading in fundus photographs. Ann Acad Med Singap. 2019; 48(9): 282–289. [PubMed]
Rasheed HA, Davis T, Morales E, et al. RimNet: a deep neural network pipeline for automated identification of the optic disc rim. Ophthalmol Sci. 2023; 3(1): 100244. [CrossRef] [PubMed]
Nugroho HA, Kirana T, Pranowo V, Hutami AHT. Optic cup segmentation using adaptive threshold and morphological image processing. Commun Sci Technol. 2019; 4(2): 63–67. [CrossRef]
Mittapalli PS, Kande GB. Segmentation of optic disk and optic cup from digital fundus images for the assessment of glaucoma. Biomed Signal Process Control. 2016; 24: 34–46. [CrossRef]
Das S, Jain A, Durai A, Gabbita S, Vasantharao A, Kotha V. Cross-dataset evaluation of multimodal neural networks for glaucoma diagnosis. In: 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA). Piscataway, NJ: Institute of Electrical and Electronics Engineers; 2022: 1–2.
Li C, Chua J, Schwarzhans F, et al. Assessing the external validity of machine learning-based detection of glaucoma. Sci Rep. 2023; 13(1): 558. [CrossRef] [PubMed]
Chaurasia AK, Greatbatch CJ, Hewitt AW. Diagnostic accuracy of artificial intelligence in glaucoma screening and clinical practice. J Glaucoma. 2022; 31(5): 285–299. [CrossRef] [PubMed]
Kovalyk O, Morales-Sánchez J, Verdú-Monedero R, Sellés-Navarro I, Palazón-Cabanes A, Sancho-Gómez JL. PAPILA: dataset with fundus images and clinical data of both eyes of the same patient for glaucoma assessment. Sci Data. 2022; 9(1): 291. [CrossRef] [PubMed]
Almazroa A, Alodhayb S, Osman E, et al. Retinal fundus images for glaucoma analysis: the RIGA dataset. In: Proceedings of SPIE 10579: Medical Imaging 2018: Imaging Informatics for Healthcare, Research, and Application. Bellingham, WA: SPIE; 2018: 55–62.
Kumar JH, Seelamantula CS, Gagan J, et al. Chákṣu: a glaucoma specific fundus image database. Sci Data. 2023; 10(1): 70. [CrossRef] [PubMed]
Sivaswamy J, Krishnadas S, Joshi GD, Jain M, Tabish AUS. Drishti-GS: Retinal Image Dataset for Optic Nerve Head (ONH) Segmentation. Piscataway, NJ: Institute of Electrical and Electronics Engineers; 2014: 53–56.
Bajwa MN, Singh GAP, Neumeier W, Malik MI, Dengel A, Ahmed S. G1020: A Benchmark Retinal Fundus Image Dataset for Computer-Aided Glaucoma Detection. Piscataway, NJ: Institute of Electrical and Electronics Engineers; 2020: 1–7.
Zhang Z, Yin FS, Liu J, et al. ORIGA-light: An Online Retinal Fundus Image Database for Glaucoma Analysis and Research. Piscataway, NJ: Institute of Electrical and Electronics Engineers; 2010: 3065–3068.
Orlando JI, Fu H, Breda JB, et al. REFUGE challenge: a unified framework for evaluating automated methods for glaucoma assessment from fundus photographs. Med Image Anal. 2020; 59: 101570. [CrossRef] [PubMed]
Batista FJF, Diaz-Aleman T, Sigut J, Alayon S, Arnay R, Angel-Pereira D. RIM-ONE DL: a unified retinal image database for assessing glaucoma using deep learning. Image Anal Stereol. 2020; 39(3): 161–167. [CrossRef]
Warfield SK, Zou KH, Wells WM. Simultaneous Truth and Performance Level Estimation (STAPLE): an algorithm for the validation of image segmentation. IEEE Trans Med Imaging. 2004; 23: 903–921. [CrossRef] [PubMed]
Althomali TA . Relative proportion of different types of refractive errors in subjects seeking laser vision correction. Open Ophthalmol J. 2018; 12(1): 53–62. [CrossRef] [PubMed]
Varghese R, Sambath M. Yolov8: a novel object detection algorithm with enhanced performance and robustness. International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS). 2024: 1–6.
Nugraha MH, Chahyati D. Tourism object detection around Monumen Nasional (Monas) using YOLO and RetinaNet. In: 2020 International Conference on Advanced Computer Science and Information Systems (ICACSIS). Piscataway, NJ: Institute of Electrical and Electronics Engineers; 2020: 317–322.
Kshirsagar V, Bhalerao RH, Chaturvedi M. Modified YOLO module for efficient object tracking in a video. IEEE Latin Am Trans. 2023; 21: 389–398. [CrossRef]
Santos C, Aguiar M, Welfer D, Belloni B. A new approach for detecting fundus lesions using image processing and deep neural network architecture based on YOLO model. Sensors. 2022; 22: 6441 [CrossRef] [PubMed]
Cheng B, Misra I, Schwing AG, Kirillov A, Girdhar R. Masked-attention mask transformer for universal image segmentation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2022: 1280–1289.
Liu Z, Lin Y, Cao Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows. IEEE/CVF International Conference on Computer Vision (ICCV). 2021: 9992–10002.
Sharma A, Agrawal M, Dutta Roy S, Gupta V. Inter-dataset performance analysis of generative adversarial networks for optic disc segmentation using digital fundus images. Res Biomed Eng. 2023; 39: 863–875. [CrossRef]
Morris E, Larrabide I, Orlando JI. Semi-supervised learning with Noisy Students improves domain generalization in optic disc and cup segmentation in uncropped fundus images. Proc Mach Learn Res. 2024; 250: 1056–1072.
Zhou W, Ji J, Jiang Y, Wang J, Qi Q, Yi Y. EARDS: EfficientNet and attention-based residual depth-wise separable convolution for joint OD and OC segmentation. Front Neurosci. 2023; 17: 1139181. [CrossRef] [PubMed]
Mohan D, Harish Kumar JR, Sekhar Seelamantula C. Optic disc segmentation using cascaded multiresolution convolutional neural networks. In: 2019 IEEE International Conference on Image Processing (ICIP). Piscataway, NJ: Institute of Electrical and Electronics Engineers; 2019: 834–838.
Liu Q, Hong X, Ke W, Chen Z, Zou B. Ddnet: cartesian-polar dual-domain network for the joint optic disc and cup segmentation. 2019, arXiv preprint arXiv:1904.08773.
Tadisetty S, Chodavarapu R, Jin R, Clements RJ, Yu M. Identifying the edges of the optic cup and the optic disc in glaucoma patients by segmentation. Sensors (Basel). 2023; 23: 4668. [CrossRef] [PubMed]
Jeon YS, Yang H, Feng M. FCSN: global context aware segmentation by learning the Fourier coefficients of objects in medical images. IEEE J Biomed Health Inform. 2024; 28: 1195–1206. [CrossRef] [PubMed]
Savini G, Carbonelli M, Parisi V, Barboni P. Repeatability of optic nerve head parameters measured by spectral-domain OCT in healthy eyes. Ophthalmic Surg Lasers Imaging. 2011; 42: 209–215. [CrossRef] [PubMed]
Agrawal A, Baxi J, Calhoun W, et al. Optic nerve head measurements with optical coherence tomography: a phantom-based study reveals differences among clinical devices. Invest Ophthalmol Vis Sci. 2016; 57: OCT413–OCT420. [CrossRef] [PubMed]
Sharma A, Oakley JD, Schiffman JC, Budenz DL, Anderson DR. Comparison of automated analysis of Cirrus HD OCT spectral-domain optical coherence tomography with stereo photographs of the optic disc. Ophthalmology. 2011; 118: 1348–1357. [CrossRef] [PubMed]
Figure 1.
 
Architecture for the proposed AI pipeline to handle both full fundus and stereoscopic images. Images are cropped tightly to the ONH before being segmented by the segmentation model and are then recovered onto the original image.
Figure 1.
 
Architecture for the proposed AI pipeline to handle both full fundus and stereoscopic images. Images are cropped tightly to the ONH before being segmented by the segmentation model and are then recovered onto the original image.
Figure 2.
 
AI VCDR predictions versus public data label VCDR. Public test set VCDR scatterplot; G1020 contained many samples without an optic cup segmentation.
Figure 2.
 
AI VCDR predictions versus public data label VCDR. Public test set VCDR scatterplot; G1020 contained many samples without an optic cup segmentation.
Figure 3.
 
AI and EHR VCDR predictions versus OCT. (A) Scatterplot of AI VCDR predictions versus OCT VCDR measurements with a detection threshold above 0.9 on the retrospectively collected private data. (B) EHR VCDR assessments against OCT VCDR measurements with a detection threshold above 0.9 on the retrospectively collected private data.
Figure 3.
 
AI and EHR VCDR predictions versus OCT. (A) Scatterplot of AI VCDR predictions versus OCT VCDR measurements with a detection threshold above 0.9 on the retrospectively collected private data. (B) EHR VCDR assessments against OCT VCDR measurements with a detection threshold above 0.9 on the retrospectively collected private data.
Figure 4.
 
Longitudinal example of a patient eye. AI segmentation masks and fundus images are shown above and below the AI prediction values.
Figure 4.
 
Longitudinal example of a patient eye. AI segmentation masks and fundus images are shown above and below the AI prediction values.
Table.
 
VCDR Evaluation Results for Private and Public Data
Table.
 
VCDR Evaluation Results for Private and Public Data
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×