October 2024
Volume 13, Issue 10
Open Access
Artificial Intelligence  |   October 2024
Clinical Utility of Deep Learning Assistance for Detecting Various Abnormal Findings in Color Retinal Fundus Images: A Reader Study
Author Affiliations & Notes
  • Joo Young Shin
    Department of Ophthalmology, Seoul Metropolitan Government Seoul National University Boramae Medical Centre, Seoul, Republic of Korea
  • Jaemin Son
    VUNO Inc., Seoul, Republic of Korea
  • Seo Taek Kong
    VUNO Inc., Seoul, Republic of Korea
  • Jeonghyuk Park
    VUNO Inc., Seoul, Republic of Korea
  • Beomhee Park
    VUNO Inc., Seoul, Republic of Korea
  • Kyu Hyung Park
    Department of Ophthalmology, Seoul National University College of Medicine, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
  • Kyu-Hwan Jung
    VUNO Inc., Seoul, Republic of Korea
    Department of Medical Device Research and Management, Samsung Advanced Institute for Health Sciences and Technology, Sungkyunkwan University, Seoul, Republic of Korea
  • Sang Jun Park
    Department of Ophthalmology, Seoul National University College of Medicine, Seoul National University Bundang Hospital, Seongnam, Republic of Korea
  • Correspondence: Sang Jun Park, Department of Ophthalmology, Seoul National University College of Medicine, Seoul National University Bundang Hospital, 82, Gumi-ro 173 Beon-gil, Bundang-gu, Seongnam-si, Gyeonggi-do 13415, Republic of Korea. e-mail: [email protected] 
  • Kyu-Hwan Jung, Department of Medical Device Research and Management, Samsung Advanced Institute for Health Sciences and Technology, Sungkyunkwan University, 115 Irwon-ro, Gangnam-gu, Seoul 06355, Republic of Korea. e-mail: [email protected] 
  • Footnotes
     JYS and JS contributed equally to this work.
Translational Vision Science & Technology October 2024, Vol.13, 34. doi:https://doi.org/10.1167/tvst.13.10.34
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Joo Young Shin, Jaemin Son, Seo Taek Kong, Jeonghyuk Park, Beomhee Park, Kyu Hyung Park, Kyu-Hwan Jung, Sang Jun Park; Clinical Utility of Deep Learning Assistance for Detecting Various Abnormal Findings in Color Retinal Fundus Images: A Reader Study. Trans. Vis. Sci. Tech. 2024;13(10):34. https://doi.org/10.1167/tvst.13.10.34.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose: To evaluate the clinical usefulness of a deep learning–based detection device for multiple abnormal findings on retinal fundus photographs for readers with varying expertise.

Methods: Fourteen ophthalmologists (six residents, eight specialists) assessed 399 fundus images with respect to 12 major ophthalmologic findings, with or without the assistance of a deep learning algorithm, in two separate reading sessions. Sensitivity, specificity, and reading time per image were compared.

Results: With algorithmic assistance, readers significantly improved in sensitivity for all 12 findings (P < 0.05) but tended to be less specific (P < 0.05) for hemorrhage, drusen, membrane, and vascular abnormality, more profoundly so in residents. Sensitivity without algorithmic assistance was significantly lower in residents (23.1%∼75.8%) compared to specialists (55.1%∼97.1%) in nine findings, but it improved to similar levels with algorithmic assistance (67.8%∼99.4% in residents, 83.2%∼99.5% in specialists) with only hemorrhage remaining statistically significantly lower. Variances in sensitivity were significantly reduced for all findings. Reading time per image decreased in images with fewer than three findings per image, more profoundly in residents. When simulated based on images acquired from a health screening center, average reading time was estimated to be reduced by 25.9% (from 16.4 seconds to 12.1 seconds per image) for residents, and by 2.0% (from 9.6 seconds to 9.4 seconds) for specialists.

Conclusions: Deep learning–based computer-assisted detection devices increase sensitivity, reduce inter-reader variance in sensitivity, and reduce reading time in less complicated images.

Translational Relevance: This study evaluated the influence that algorithmic assistance in detecting abnormal findings on retinal fundus photographs has on clinicians, possibly predicting its influence on clinical application.

Introduction
Retinal fundus examination allows recognition of potentially vision-threatening conditions such as diabetic retinopathy, age-related macular degeneration, and glaucoma.14 Driven by recent developments in deep learning technologies,5 attempts to automate retinal fundus screening procedures have been implemented to detect ophthalmic diseases from macula-centered color fundus photographs.611 These may aid in enhancing the previously questioned cost-effectiveness of retinal fundus screening procedures due to a lack of trained ophthalmologists to interpret these images. In an effort to estimate the impact of these algorithms on human readers, one study compared the performance of the readers in controlled experimental settings, with algorithmic assistance and without,12 and the results showed that algorithmic assistance could increase sensitivity while maintaining specificity. 
Previous studies reporting outstanding standalone performance of deep learning–based models in macula-centered fundus images have focused on identifying the presence of individual ocular diseases. However, the obscureness of these models with regard to how they make their decisions has been the leading cause of reluctance to use them as computer-aided diagnosis systems. Another limitation of preexisting algorithms for fundus image analysis is that they are capable of examining only a few ophthalmologic diseases (e.g., diabetic retinopathy), whereas more comprehensive coverage of multiple common abnormal retinal conditions is necessary for practical deployment of these systems in clinical settings. Instead of making direct assessments of a certain specific disease, aiming to recognize lesions of various major ophthalmologic abnormal findings with macula-centered color fundus photographs13 offers some benefits in clinical application. This approach may enable screening retinal pathologies other than the trained disease, such as diabetic retinopathy, age-related macular degeneration, and glaucoma, among the many diseases that can lead to vision impairment and blindness.14,15 Furthermore, identification of lesions may mitigate the problem of intervariability that resides in grading the severity of a specific disease (e.g., diabetic retinopathy), as readers may not agree upon the consequential severity but may concur with regard to the existence of lesions.1621 Also, identifying findings rather than directly presenting a diagnosis may aid in enhancing the interpretation of algorithm decisions by ophthalmologists. 
Various algorithms have been developed and are being considered for application in various clinical conditions, including as a screening program where ophthalmologists are unavailable, and also as a triage system to lighten the work load of ophthalmologists. VUNO Med-Fundus AI is an algorithmic detection device designed to detect multiple abnormal retinal fundus findings, and its performance has been reported to have high sensitivity and specificity.13 However, little is known of how such algorithmic devices can affect clinical decisions in clinical applications. In this study, we investigated the effectiveness of deep learning–based guidance for readers in assessing major abnormal findings in retinal fundus photographs to evaluate the effect of algorithmic guidance on the performance of clinicians with various levels of expertise in the use of computer-assisted detection devices. The impact of algorithmic assistance on sensitivity, specificity, and reading time was measured, with comparisons made between an experienced reader group and a less experienced reader group. 
Methods
Deep Learning Algorithm
VUNO Med-Fundus AI is the commercially available advanced version of the previously reported deep learning algorithm that classifies and localizes 12 major ophthalmologic findings (hemorrhage, hard exudate, cotton wool patch, drusen, membrane, macular hole, myelinated nerve fiber, chorioretinal atrophy or scar, any vascular abnormality, retinal nerve fiber layer defect, glaucomatous disc change, and non-glaucomatous disc change).13,22 Details of the algorithm are available in the Supplementary Material
Study Images and Establishing the Reference Standards
The images used in the study were collected from a previously established dataset consisting of images collected from an ophthalmology clinic in a referral hospital setting. Details on the acquisition of study images and establishment of the reference standard are available in the Supplementary Material. Supplementary Table S1 presents demographic information about the study dataset. We used 399 labeled images containing at least 30 cases for all findings except for myelinated nerve fiber of 26 cases, as 25 was the calculated minimum sample size for a margin of error of 0.2. The number of total images was restricted to a certain level that could be evaluated by a single reader in a single session, to rule out any confounding factors such as the fatigue of the reader and necessity of multiple reading sessions. The institutional review board at Seoul National University Bundang Hospital approved the usage of the data for this study (B-1703-386-103). Written consent was waived by the institutional review board due to the retrospective nature of the study, and the study was conducted in accordance with the tenets of the Declaration of Helsinki. 
Reading Procedures in the Reader Study
Fourteen ophthalmologists participated as readers in the reader study, six of whom were first-year residents training in ophthalmology at the time of the study; eight were ophthalmologists licensed in South Korea (five retina specialists with 4 to 12 years of experience and three glaucoma specialists with 4 to 9 years of experience). All 14 readers assessed the same set of 399 fundus images in two different reading sessions with or without the assistance of algorithmic predictions in random order. Between the two reading sessions, there were washout periods of at least 2 weeks. Images were randomly shuffled for every reading session. Readers were masked from the performance results of the algorithm and were told that the performance of the algorithm was questionable before the study. In the study, a reading tool was used in both sessions with and without algorithmic assistance (Supplementary Fig. S1). For the sessions with algorithmic assistance, the reading tool displayed prediction results of the deep learning–based algorithm superimposed on the fundus image on the left side, and the findings were given on the right panel. All identified findings were displayed initially, but when the reader hovered the mouse over a finding on the right panel only the corresponding contour was shown. Readers were asked to modify the choice of the predicted presence of the findings in the right panel. For the sessions without algorithmic assistance, only the fundus image was given with the list of findings on the right panel, and the reader was asked to choose the present findings. Reading time per image was recorded. All reading was performed at the same reading center under equal environment for all readers. Display monitors with a resolution of 3840 × 2160 pixels were used with a desktop computer with enough computing and memory capacity for reading without delays (Intel Core i7-10700 Processor at 2.90 GHz and 16 GB RAM; Intel Corporation, Santa Clara, CA). No external network latency existed, as each computer was directly connected to the server via local private network. 
Age and gender for each fundus image were provided in the display. Information on all functions in the reading tool, including a short-cut key for turning on and off the contours of detected lesions and the use of mouse scrolling to adjust contrast and size, was provided to the readers in detail before the study. The readers were allowed time to practice the reading tool until they felt comfortable with it. The institutional review board at Seoul Metropolitan Government Seoul National University Boramae Hospital approved the protocol for the reader study (IRB No. 30-2020-156). Written consent was waived by the institutional review board due to the retrospective nature of the study, and the study was conducted in accordance with the tenets of the Declaration of Helsinki. 
Simulation of Reading Time in a Health Screening Center Setting
To evaluate the realistic benefits of the algorithm-assisted reading tool in a clinical setting, we analyzed a set of 4628 fundus images acquired from the Seoul National University Bundang Hospital healthcare screening center over a period of 2 consecutive years to estimate improvements in the real-world clinical environment. Predictions were generated by the algorithm to derive the distribution of the number of predicted findings in these images. Then, using the average reading time data of each reader according to the number of findings per image, the entire reading time for the whole dataset was estimated. The difference in the entire reading time with or without algorithmic assistance was evaluated. 
Statistical Analyses
Sensitivities and specificities were measured for the deep learning algorithm and human readers in both settings and compared. The confidence intervals (CIs) for the algorithm were computed using the Clopper–Pearson exact method, and unbiased standard deviations were directly measured from the results of individual readers. Two-sided paired-sample t-tests were used to compare reader sensitivity and specificity with or without algorithmic assistance, and the two-sided one-sample t-test was used to compare the performances of readers with that of the algorithm. Two-sided Welch's t-tests were used to compare differences among residents and specialists, assuming that variances differed between the groups. One-way analyses of variance (ANOVAs) were used to test whether variances differed between two groups. For the comparison of reading time with or without algorithmic assistance, the two-sided one-sample t-test was used. Mixed-effects regression analyses were conducted to quantify the effects of various factors (i.e., existence of algorithmic assistance, number of findings in the case, reading session, readers, and cases) on sensitivity, specificity, and reading time. Logistic regressors were used for sensitivity and specificity, and linear regressors were used for reading time. All factors were set as fixed-effects variables, and the regression coefficient and statistical significance were observed. We used the statistical modules in SciPy 1.2.1 (https://www.scipy.org) for the statistical tests, and statsmodels 0.12.1 (https://www.statsmodels.org) was used for mixed-effects regression with Python 3.6 (http://www.python.org). 
Results
Effects of Algorithmic Assistance in Sensitivity and Specificity
Changes in sensitivity and specificity with algorithmic assistance are illustrated in Figure 1, and the exact point estimates, CIs, and P values are listed in Table 1 (Supplementary Table S2 shows results for the residents, and Supplementary Table S3 shows results for the specialists). Throughout all of the findings, sensitivity tended to increase if readers were provided with the algorithmic assistance. Sensitivity gains were statistically significant (P < 0.05) for all findings for all readers. Residents gained sensitivity significantly for all findings with the algorithmic assistance except for non-glaucomatous disc change. Specialists experienced significant sensitivity gains for all findings except hemorrhage, chorioretinal atrophy, and glaucomatous disc change. Specificity tended to decrease when readers assessed images with algorithmic assistance, although the decrease was not statistically significant for all findings except hemorrhage, drusen, membrane, and vascular abnormality. Specificity dropped for the assessment of hemorrhage for residents, and specialists lost significant specificity when assessing hard exudate, drusen, membrane, macular hole, vascular abnormality, glaucomatous disc change, and non-glaucomatous disc change. Variances in sensitivity also significantly diminished among readers with algorithmic assistance for all findings, but the variance in specificity decreased significantly only with the algorithmic assistance for drusen and vascular abnormality (Supplementary Table S4). 
Figure 1.
 
Sensitivity and specificity for algorithmic assistance with and without computer-assisted detection (CAD). Performance of the algorithm is indicated by black crosses. *P < 0.05, **P < 0.01, ***P < 0.001 (paired-sample t-test). CWP, cotton wool patch; RNFL, retinal nerve fiber layer.
Figure 1.
 
Sensitivity and specificity for algorithmic assistance with and without computer-assisted detection (CAD). Performance of the algorithm is indicated by black crosses. *P < 0.05, **P < 0.01, ***P < 0.001 (paired-sample t-test). CWP, cotton wool patch; RNFL, retinal nerve fiber layer.
Table 1.
 
Sensitivity and Specificity for the 12 Findings for All Readers
Table 1.
 
Sensitivity and Specificity for the 12 Findings for All Readers
Table 2 compares differences in sensitivity and specificity among residents and specialists, with or without algorithmic assistance. Sensitivity without algorithmic assistance was significantly lower in residents (23.1%∼75.8%) compared to specialists (55.1%∼97.1%) in all findings but macular hole, chorioretinal atrophy, and non-glaucomatous disc change. This significant difference in sensitivity between residents and specialists disappeared with algorithmic assistance (67.8%∼99.4% in residents and 83.2%∼99.5% in specialists), except for hemorrhage. As for specificity, significant differences among residents and specialists were found only in cotton wool patch and macular hole without assistance, and the difference for cotton wool patch lost significance when algorithmic support was provided. 
Table 2.
 
Sensitivity and Specificity Comparison of the 12 Findings Among Residents and Specialists
Table 2.
 
Sensitivity and Specificity Comparison of the 12 Findings Among Residents and Specialists
To evaluate the effects of the predictions of the algorithm on the readers’ decisions, sensitivity and specificity changes according to the algorithm's predictions are shown in Supplementary Figure S2. For cases that the algorithm predicted as positive, readers tended to gain sensitivity but with compromised specificity. On the other hand, when assessing cases that the algorithm predicted as negative, readers tended to lose sensitivity and gain specificity. Exemplar cases indicating that readers’ decisions were improved by algorithmic assistance are shown in Supplementary Figure S3
Effects of the Algorithmic Assistance in Reading Time
Figure 2A illustrates the average improvement in reading time according to the number of predicted findings. In cases with fewer than three findings, average reading time decreased with algorithmic assistance. This reductions in reading time were greater in residents (62.9%, 26.4%, and 24.1%) for zero, one, and two findings predicted, respectively, whereas the reductions were 37.1%, 2.5%, and 3.5%, respectively, for specialists. However, the reading time increased when the readers were given three or more findings by the algorithm. When the improvement in reading time was modeled by linear regression with the number of predicted findings as a single variable, the slope was steeper for residents (regression coefficient = −0.176, P < 0.001) than that for specialists (regression coefficient = −0.123, P < 0.001). This indicates that residents are more likely to lose the benefit of algorithmic assistance in reading time for cases with many predicted findings. 
Figure 2.
 
(A) Improvement in reading time per image by the number of predicted findings. The improvement was defined as the amount of reduction in the average reading time when readers were supported by the algorithm. Each pair marks the average reading time for specialists and residents when a case included the specified number of findings. (B) Distribution of cases in a health screening center simulated scenario by the number of predicted findings. (C) Reading time in the simulated scenario with and without algorithmic assistance. w/, with; w/o, without.
Figure 2.
 
(A) Improvement in reading time per image by the number of predicted findings. The improvement was defined as the amount of reduction in the average reading time when readers were supported by the algorithm. Each pair marks the average reading time for specialists and residents when a case included the specified number of findings. (B) Distribution of cases in a health screening center simulated scenario by the number of predicted findings. (C) Reading time in the simulated scenario with and without algorithmic assistance. w/, with; w/o, without.
Changes in reading times with algorithmic assistance were simulated in a health screening center database to estimate annual gains in reading time. Figure 2B shows the distribution of the number of predicted findings per image in the dataset acquired from the health screening center. Most of cases had fewer than three findings for which there was a beneficial decrease in reading time with algorithmic assistance. Figure 2C shows the expected average reading time had they assessed these fundus images. This decrease was greater among residents, with the average reading time being reduced from 16.4 to 12.1 seconds per image, which translates to a 25.9% reduction. However, specialists’ reading time was only decreased by 2.0%, from 9.6 to 9.4 seconds per image. 
Mixed-Effects Regression Analyses
Table 3 summarizes the mixed-effects regression analyses for sensitivity, specificity, and reading time. The reading being assisted with the algorithm was the most significant variable affecting sensitivity and specificity. The positive coefficient for sensitivity suggests that algorithmic assistance improved sensitivity, and the negative coefficient for specificity indicates decreased specificity in readers with use of the algorithm. The most significant variable affecting reading time was the number of findings in a given case. The positive coefficient indicates that readers took more time to assess cases that included multiple findings. In the second reading session, the reading time was slightly longer at less than a second (regression coefficient = 0.720) with less sensitivity and specificity (regression coefficients = −0.096 and −0.089), which shows only slight experimental bias. Throughout all analyses, the effects of readers and cases were negligible as coefficients were close to 0, suggesting that readers and cases had been appropriately randomized in the study. 
Table 3.
 
Mixed-Effects Regression Analyses for Sensitivity, Specificity, and Reading Time
Table 3.
 
Mixed-Effects Regression Analyses for Sensitivity, Specificity, and Reading Time
Discussion
In assessing macula-centered retinal fundus images with respect to 12 major ophthalmologic abnormal findings, readers gained sensitivity with significant margins without compromising specificity significantly with the assistance of algorithmic predictions including lesion contours. Reading time decreased in cases with fewer than three findings per image, but increased in cases with three or more findings. Mixed-effects regression analyses indicated that usage of algorithmic assistance was the most significant variable affecting sensitivity and specificity, and the number of findings per image was the most significant variable affecting reading time. 
In a previous reader study, algorithmic assistance contributed in the improvement of sensitivity in detecting referable diabetic retinopathy as unassisted readers preferred to assign a lower severity level for ambiguous cases.12 Our study also showed the tendency of unassisted readers toward highly specific operating points. With algorithmic assistance, readers were able to detect otherwise difficult to identify pathological abnormalities, and false negatives were reduced. For example, identifying retinal nerve fiber layer defects, which can be missed without careful attention, showed large improvements in sensitivity with algorithmic assistance. In other reader studies with mammography, assisted readers detected cancer more sensitively in dense breasts; unassisted readers had difficulty in differentiating between benign tissue and cancer but the algorithm discriminated between the two with higher accuracy.23,24 As sensitivity was increased, readers managed to filter out false-positive algorithm predictions, effectively maintaining specificity. This maintenance of specificity in our study suggests that the significant sensitivity gains in the assisted readers did not originate from simply altering operating points. The assisted readers achieved significantly higher specificity than the algorithm in isolation by rejecting false-positive cases when the algorithm was incorrect, which may justify maintaining high sensitivity operating points for the algorithm, as even training resident readers could effectively rule out false-positive predictions. High sensitivity operating points also coincide with the requirement of rigorous sensitivity standards in fundus screening examination.3,25 
Intervariability in readings of medical images is a well-known issue.1620 In our study, variance in sensitivity among readers could be significantly reduced with algorithmic assistance, resulting in more consistent results among readers. In a previous study, inexperienced readers showed higher intervariability when evaluating the severity of diabetic retinopathy.21 Similarly, in our study, residents tended to show greater variance in sensitivity and specificity without algorithmic assistance, but with algorithmic assistance the variance could be reduced to levels that were less than those of specialists without assistance. More intriguing are the patterns indicating that the sensitivity of the assisted readers converges to the sensitivity of the algorithm. This implies that algorithmic assistance may help establish consistent diagnostic standards with the desired sensitivity level. 
There was a strong negative linear relationship between the improvement in reading time and the number of predicted findings by the algorithm. Assisted readers took more time than unassisted readers in cases with more than two predicted findings, suggesting that the assisted readers only saved time in assessing relatively healthy fundus images. This was more profound in residents compared to specialists, with residents requiring more time at a steeper rate when the number of predicted findings given by the algorithm increased, suggesting greater reliance on the algorithmic predictions. Based on qualitative analyses of the readers’ usage of the reading tool (Supplementary Video), readers tended to turn contours on and off and hover over individual findings more often in complicated images with more predicted findings. From this observation, we could hypothesize that better user interface (UI) and user experience (UX) designs may help reduce this increase in reading time with more given findings. Assisted readers took less time in cases with fewer than three predicted findings, indicating a potential role of algorithmic assistance in evaluating fundus images for screening purposes. Given that the majority of cases in screening centers are taken from asymptomatic healthy people, readers will be able to reduce reading time with algorithmic assistance, as determined in our simulated scenario, with a decrease as great as 25.9% for less experienced readers. Increased sensitivity and reduced total reading time may provide economic benefits in clinics, as longer reading times mean more work for physicians and greater expenses for patients, and overlooked pathologies could have catastrophic results for both patients and physicians. Several studies in pathology align with our results in that the average assisted reader reduced their reading time by roughly half in datasets with predominantly benign cases.26,27 In bone age estimation, deep learning–assisted readers were reported to have reduced their reading time by up to 40%.28 With higher sensitivity settings, reliance on negative predictions of the algorithm can be enhanced and may result in further reducing the total reading time in screening center settings. Future studies may focus on observing the impact of various operating points (e.g., sensitive, specific, balanced) on correctness, inter-reader variability, and reading time and could identify the optimal settings for various clinical needs. 
The results of the mixed-effects regression analyses agree with other results in demonstrating that usage of algorithmic assistance increases sensitivity (positive coefficients) and decreases specificity (negative coefficients) as the most significant and effective variables. Complex cases with a greater number of positive findings based on reference standards negatively affected specificity and positively affected reading time, even when considering usage of algorithmic assistance as a separate variable. Increased numbers of positive findings may have caused the readers to include surplus findings, resulting in decreased specificity and increased time to assess each finding, regardless of using algorithmic assistance. Individual conditions of real clinical settings where computer-assisted diagnosis devices could be used should be considered thoroughly to optimize their usefulness. 
A major shortcoming of this study is the unmaskable nature of the usage of algorithmic assistance to the readers, which may have significantly affected the results of the study. The algorithm itself has limitations in that only a limited number of abnormal findings were included, and lesions of various severity were all grouped into one category. For example, microaneurysms were not discernable from dot hemorrhages with fundus images only; they were grouped into hemorrhage, and various levels and forms of hemorrhage were all included in one category. Other limitations include the limited variety of images with regard to ethnicity and diseases included in the study, and the number of readers included in the study may not have been sufficient for the detection of smaller differences among the groups. The study was conducted on a previously established dataset with high image quality, which may differ from actual clinical settings and affects direct application of the results in real-world settings. Also, this study aimed to evaluate how algorithmic assistance affects clinicians’ performance in evaluating only fundus photographs, thus excluding clinical information other than age and gender of the patient. For this reason, the positive influence of computer-guided detection may have been overestimated in the study setting compared to real-life clinical settings, where various other information aiding clinical decisions is available. Cost analysis was also not performed in this study. Larger prospective studies using algorithmic assistance in real-world clinical settings should be conducted in the future to evaluate its actual effectiveness. 
This study showed that readers could benefit from the assistance of deep learning with regard to sensitivity without compromising specificity when evaluating multiple findings in retinal fundus images. Also, assisted readers could spend less reading time on healthy fundus images, and less experienced readers could save even more time and gain greater sensitivity with algorithmic assistance. Our observations in a simulated scenario, though preliminary, suggest that deep learning assistance could economize fundus examinations in screening centers by reducing reading time, the number of false negatives, and variances in reading performance between readers. 
Acknowledgments
Supported by VUNO Inc. 
Disclosure: J.Y. Shin, None; J. Son, VUNO Inc. (E); S.T. Kong, VUNO Inc. (E); J. Park, VUNO Inc. (E); B. Park, VUNO Inc. (E); K.H. Park, None; K.-H. Jung, VUNO Inc. (F); S.J. Park, VUNO Inc. (F) 
References
ETDRSR Group. Grading diabetic retinopathy from stereoscopic color fundus photographs—an extension of the modified Airlie House classification. ETDRS report number 10. Early Treatment Diabetic Retinopathy Study Research Group. Ophthalmology. 1991; 98: 786–806. [CrossRef] [PubMed]
Zhang X, Saaddine JB, Chou CF, et al. Prevalence of diabetic retinopathy in the United States, 2005–2008. JAMA. 2010; 304: 649–656. [CrossRef] [PubMed]
Chakrabarti R, Harper CA, Keeffe JE. Diabetic retinopathy management guidelines. Exp Rev Ophthalmol. 2012; 7: 417–439. [CrossRef]
Solomon SD, Chew E, Duh EJ, et al. Diabetic retinopathy: a position statement by the American Diabetes Association. Diabetes Care. 2017; 40: 412–418. [CrossRef] [PubMed]
LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015; 521: 436–444. [CrossRef] [PubMed]
Grassmann F, Mengelkamp J, Brandl C, et al. A deep learning algorithm for prediction of age-related eye disease study severity scale for age-related macular degeneration from color fundus photography. Ophthalmology. 2018; 125: 1410–1420. [CrossRef] [PubMed]
Peng Y, Dharssi S, Chen Q, et al. DeepSeeNet: a deep learning model for automated classification of patient-based age-related macular degeneration severity from color fundus photographs. Ophthalmology. 2019; 126: 565–575. [CrossRef] [PubMed]
Gargeya R, Leng T. Automated identification of diabetic retinopathy using deep learning. Ophthalmology. 2017; 124: 962–969. [CrossRef] [PubMed]
Gulshan V, Peng L, Coram M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016; 316: 2402–2410. [CrossRef] [PubMed]
Ting DSW, Cheung CY, Lim G, et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA. 2017; 318: 2211–2223. [CrossRef] [PubMed]
Li Z, He Y, Keel S, Meng W, Chang RT, He M. Efficacy of a deep learning system for detecting glaucomatous optic neuropathy based on color fundus photographs. Ophthalmology. 2018; 125: 1199–1206. [CrossRef] [PubMed]
Taly A, Joseph A, Sood A, et al. Using a deep learning algorithm and integrated gradient explanation to assist grading for diabetic retinopathy. Ophthalmology. 2019; 126: 552–564. [PubMed]
Son J, Shin JY, Kim HD, Jung KH, Park KH, Park SJ. Development and validation of deep learning models for screening multiple abnormal findings in retinal fundus images. Ophthalmology. 2020; 127: 85–94. [CrossRef] [PubMed]
Congdon N, O'Colmain B, Klaver C, et al. Causes and prevalence of visual impairment among adults in the United States. Arch Ophthalmol. 2004; 122: 477–485. [PubMed]
Congdon NG, Friedman DS, Lietman T. Important causes of visual impairment in the world today. JAMA. 2003; 290: 2057–2060. [CrossRef] [PubMed]
Scott IU, Bressler NM, Bressler SB, et al. Agreement between clinician and reading center gradings of diabetic retinopathy severity level at baseline in a phase 2 study of intravitreal bevacizumab for diabetic macular edema. Retina. 2008; 28: 36–40. [CrossRef] [PubMed]
Elmore JG, Longton GM, Carney PA, et al. Diagnostic concordance among pathologists interpreting breast biopsy specimens. JAMA. 2015; 313: 1122–1132. [CrossRef] [PubMed]
Lytwyn A, Salit IE, Raboud J, et al. Interobserver agreement in the interpretation of anal intraepithelial neoplasia. Cancer. 2005; 103: 1447–1456. [CrossRef] [PubMed]
Ruamviboonsuk P, Teerasuwanajak K, Tiensuwan M, Yuttitham K, Thai Screening for Diabetic Retinopathy Study Group. Interobserver agreement in the interpretation of single-field digital fundus images for diabetic retinopathy screening. Ophthalmology. 2006; 113: 826–832. [CrossRef] [PubMed]
Elmore JG, Wells CK, Lee CH, Howard DH, Feinstein AR. Variability in radiologists’ interpretations of mammograms. N Engl J Med. 1994; 331: 1493–1499. [CrossRef] [PubMed]
Krause J, Gulshan V, Rahimy E, et al. Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy. Ophthalmology. 2018; 125: 1264–1272. [CrossRef] [PubMed]
Park SJ, Shin JY, Kim S, Son J, Jung KH, Park KH. A novel fundus image reading tool for efficient generation of a multi-dimensional categorical image database for machine learning algorithm training. J Korean Med Sci. 2018; 33: e239. [CrossRef] [PubMed]
Kim H-E, Kim HH, Han B-K, et al. Changes in cancer detection and false-positive recall in mammography using artificial intelligence: a retrospective, multireader study. Lancet Digit Health. 2020; 2: e138–e148. [CrossRef] [PubMed]
Freer PE. Mammographic breast density: impact on breast cancer risk and implications for screening. Radiographics. 2015; 35: 302–315. [CrossRef] [PubMed]
Marmor MF, Kellner U, Lai TY, Lyons JS, Mieler WF. Revised recommendations on screening for chloroquine and hydroxychloroquine retinopathy. Ophthalmology. 2011; 118: 415–422. [CrossRef] [PubMed]
Park J, Jang BG, Kim YW, et al. A prospective validation and observer performance study of a deep learning algorithm for pathologic diagnosis of gastric tumors in endoscopic biopsies. Clin Cancer Res. 2021; 27: 719–728. [CrossRef] [PubMed]
Biscotti CV, Dawson AE, Dziura B, et al. Assisted primary screening using the automated ThinPrep Imaging System. Am J Clin Pathol. 2005; 123: 281–287. [CrossRef] [PubMed]
Kim JR, Shim WH, Yoon HM, et al. Computerized bone age estimation using deep learning based program: evaluation of the accuracy and efficiency. Am J Roentgenol. 2017; 209: 1374–1380. [CrossRef]
Figure 1.
 
Sensitivity and specificity for algorithmic assistance with and without computer-assisted detection (CAD). Performance of the algorithm is indicated by black crosses. *P < 0.05, **P < 0.01, ***P < 0.001 (paired-sample t-test). CWP, cotton wool patch; RNFL, retinal nerve fiber layer.
Figure 1.
 
Sensitivity and specificity for algorithmic assistance with and without computer-assisted detection (CAD). Performance of the algorithm is indicated by black crosses. *P < 0.05, **P < 0.01, ***P < 0.001 (paired-sample t-test). CWP, cotton wool patch; RNFL, retinal nerve fiber layer.
Figure 2.
 
(A) Improvement in reading time per image by the number of predicted findings. The improvement was defined as the amount of reduction in the average reading time when readers were supported by the algorithm. Each pair marks the average reading time for specialists and residents when a case included the specified number of findings. (B) Distribution of cases in a health screening center simulated scenario by the number of predicted findings. (C) Reading time in the simulated scenario with and without algorithmic assistance. w/, with; w/o, without.
Figure 2.
 
(A) Improvement in reading time per image by the number of predicted findings. The improvement was defined as the amount of reduction in the average reading time when readers were supported by the algorithm. Each pair marks the average reading time for specialists and residents when a case included the specified number of findings. (B) Distribution of cases in a health screening center simulated scenario by the number of predicted findings. (C) Reading time in the simulated scenario with and without algorithmic assistance. w/, with; w/o, without.
Table 1.
 
Sensitivity and Specificity for the 12 Findings for All Readers
Table 1.
 
Sensitivity and Specificity for the 12 Findings for All Readers
Table 2.
 
Sensitivity and Specificity Comparison of the 12 Findings Among Residents and Specialists
Table 2.
 
Sensitivity and Specificity Comparison of the 12 Findings Among Residents and Specialists
Table 3.
 
Mixed-Effects Regression Analyses for Sensitivity, Specificity, and Reading Time
Table 3.
 
Mixed-Effects Regression Analyses for Sensitivity, Specificity, and Reading Time
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×