Abstract
Purpose:
To evaluate the clinical usefulness of a deep learning–based detection device for multiple abnormal findings on retinal fundus photographs for readers with varying expertise.
Methods:
Fourteen ophthalmologists (six residents, eight specialists) assessed 399 fundus images with respect to 12 major ophthalmologic findings, with or without the assistance of a deep learning algorithm, in two separate reading sessions. Sensitivity, specificity, and reading time per image were compared.
Results:
With algorithmic assistance, readers significantly improved in sensitivity for all 12 findings (P < 0.05) but tended to be less specific (P < 0.05) for hemorrhage, drusen, membrane, and vascular abnormality, more profoundly so in residents. Sensitivity without algorithmic assistance was significantly lower in residents (23.1%∼75.8%) compared to specialists (55.1%∼97.1%) in nine findings, but it improved to similar levels with algorithmic assistance (67.8%∼99.4% in residents, 83.2%∼99.5% in specialists) with only hemorrhage remaining statistically significantly lower. Variances in sensitivity were significantly reduced for all findings. Reading time per image decreased in images with fewer than three findings per image, more profoundly in residents. When simulated based on images acquired from a health screening center, average reading time was estimated to be reduced by 25.9% (from 16.4 seconds to 12.1 seconds per image) for residents, and by 2.0% (from 9.6 seconds to 9.4 seconds) for specialists.
Conclusions:
Deep learning–based computer-assisted detection devices increase sensitivity, reduce inter-reader variance in sensitivity, and reduce reading time in less complicated images.
Translational Relevance:
This study evaluated the influence that algorithmic assistance in detecting abnormal findings on retinal fundus photographs has on clinicians, possibly predicting its influence on clinical application.
In assessing macula-centered retinal fundus images with respect to 12 major ophthalmologic abnormal findings, readers gained sensitivity with significant margins without compromising specificity significantly with the assistance of algorithmic predictions including lesion contours. Reading time decreased in cases with fewer than three findings per image, but increased in cases with three or more findings. Mixed-effects regression analyses indicated that usage of algorithmic assistance was the most significant variable affecting sensitivity and specificity, and the number of findings per image was the most significant variable affecting reading time.
In a previous reader study, algorithmic assistance contributed in the improvement of sensitivity in detecting referable diabetic retinopathy as unassisted readers preferred to assign a lower severity level for ambiguous cases.
12 Our study also showed the tendency of unassisted readers toward highly specific operating points. With algorithmic assistance, readers were able to detect otherwise difficult to identify pathological abnormalities, and false negatives were reduced. For example, identifying retinal nerve fiber layer defects, which can be missed without careful attention, showed large improvements in sensitivity with algorithmic assistance. In other reader studies with mammography, assisted readers detected cancer more sensitively in dense breasts; unassisted readers had difficulty in differentiating between benign tissue and cancer but the algorithm discriminated between the two with higher accuracy.
23,24 As sensitivity was increased, readers managed to filter out false-positive algorithm predictions, effectively maintaining specificity. This maintenance of specificity in our study suggests that the significant sensitivity gains in the assisted readers did not originate from simply altering operating points. The assisted readers achieved significantly higher specificity than the algorithm in isolation by rejecting false-positive cases when the algorithm was incorrect, which may justify maintaining high sensitivity operating points for the algorithm, as even training resident readers could effectively rule out false-positive predictions. High sensitivity operating points also coincide with the requirement of rigorous sensitivity standards in fundus screening examination.
3,25
Intervariability in readings of medical images is a well-known issue.
16–20 In our study, variance in sensitivity among readers could be significantly reduced with algorithmic assistance, resulting in more consistent results among readers. In a previous study, inexperienced readers showed higher intervariability when evaluating the severity of diabetic retinopathy.
21 Similarly, in our study, residents tended to show greater variance in sensitivity and specificity without algorithmic assistance, but with algorithmic assistance the variance could be reduced to levels that were less than those of specialists without assistance. More intriguing are the patterns indicating that the sensitivity of the assisted readers converges to the sensitivity of the algorithm. This implies that algorithmic assistance may help establish consistent diagnostic standards with the desired sensitivity level.
There was a strong negative linear relationship between the improvement in reading time and the number of predicted findings by the algorithm. Assisted readers took more time than unassisted readers in cases with more than two predicted findings, suggesting that the assisted readers only saved time in assessing relatively healthy fundus images. This was more profound in residents compared to specialists, with residents requiring more time at a steeper rate when the number of predicted findings given by the algorithm increased, suggesting greater reliance on the algorithmic predictions. Based on qualitative analyses of the readers’ usage of the reading tool (
Supplementary Video), readers tended to turn contours on and off and hover over individual findings more often in complicated images with more predicted findings. From this observation, we could hypothesize that better user interface (UI) and user experience (UX) designs may help reduce this increase in reading time with more given findings. Assisted readers took less time in cases with fewer than three predicted findings, indicating a potential role of algorithmic assistance in evaluating fundus images for screening purposes. Given that the majority of cases in screening centers are taken from asymptomatic healthy people, readers will be able to reduce reading time with algorithmic assistance, as determined in our simulated scenario, with a decrease as great as 25.9% for less experienced readers. Increased sensitivity and reduced total reading time may provide economic benefits in clinics, as longer reading times mean more work for physicians and greater expenses for patients, and overlooked pathologies could have catastrophic results for both patients and physicians. Several studies in pathology align with our results in that the average assisted reader reduced their reading time by roughly half in datasets with predominantly benign cases.
26,27 In bone age estimation, deep learning–assisted readers were reported to have reduced their reading time by up to 40%.
28 With higher sensitivity settings, reliance on negative predictions of the algorithm can be enhanced and may result in further reducing the total reading time in screening center settings. Future studies may focus on observing the impact of various operating points (e.g., sensitive, specific, balanced) on correctness, inter-reader variability, and reading time and could identify the optimal settings for various clinical needs.
The results of the mixed-effects regression analyses agree with other results in demonstrating that usage of algorithmic assistance increases sensitivity (positive coefficients) and decreases specificity (negative coefficients) as the most significant and effective variables. Complex cases with a greater number of positive findings based on reference standards negatively affected specificity and positively affected reading time, even when considering usage of algorithmic assistance as a separate variable. Increased numbers of positive findings may have caused the readers to include surplus findings, resulting in decreased specificity and increased time to assess each finding, regardless of using algorithmic assistance. Individual conditions of real clinical settings where computer-assisted diagnosis devices could be used should be considered thoroughly to optimize their usefulness.
A major shortcoming of this study is the unmaskable nature of the usage of algorithmic assistance to the readers, which may have significantly affected the results of the study. The algorithm itself has limitations in that only a limited number of abnormal findings were included, and lesions of various severity were all grouped into one category. For example, microaneurysms were not discernable from dot hemorrhages with fundus images only; they were grouped into hemorrhage, and various levels and forms of hemorrhage were all included in one category. Other limitations include the limited variety of images with regard to ethnicity and diseases included in the study, and the number of readers included in the study may not have been sufficient for the detection of smaller differences among the groups. The study was conducted on a previously established dataset with high image quality, which may differ from actual clinical settings and affects direct application of the results in real-world settings. Also, this study aimed to evaluate how algorithmic assistance affects clinicians’ performance in evaluating only fundus photographs, thus excluding clinical information other than age and gender of the patient. For this reason, the positive influence of computer-guided detection may have been overestimated in the study setting compared to real-life clinical settings, where various other information aiding clinical decisions is available. Cost analysis was also not performed in this study. Larger prospective studies using algorithmic assistance in real-world clinical settings should be conducted in the future to evaluate its actual effectiveness.
This study showed that readers could benefit from the assistance of deep learning with regard to sensitivity without compromising specificity when evaluating multiple findings in retinal fundus images. Also, assisted readers could spend less reading time on healthy fundus images, and less experienced readers could save even more time and gain greater sensitivity with algorithmic assistance. Our observations in a simulated scenario, though preliminary, suggest that deep learning assistance could economize fundus examinations in screening centers by reducing reading time, the number of false negatives, and variances in reading performance between readers.
Disclosure: J.Y. Shin, None; J. Son, VUNO Inc. (E); S.T. Kong, VUNO Inc. (E); J. Park, VUNO Inc. (E); B. Park, VUNO Inc. (E); K.H. Park, None; K.-H. Jung, VUNO Inc. (F); S.J. Park, VUNO Inc. (F)