Abstract
Purpose:
Retinopathy of prematurity (ROP) is a leading cause of childhood blindness. An accurate and timely diagnosis of the early stages of ROP allows ophthalmologists to recommend appropriate treatment while blindness is still preventable. The purpose of this study was to develop an automatic deep convolutional neural network–based system that provided a diagnosis of stage I to III ROP with feature parameters.
Methods:
We developed three data sets containing 18,827 retinal images of preterm infants. These retinal images were obtained from the ophthalmology department of Jiaxing Maternal and Child Health Hospital in China. After segmenting images, we calculated the region of interest (ROI). We trained our system based on segmented ROI images from the training data set, tested the performance of the classifier on the test data set, and evaluated the widths of the demarcation lines or ridges extracted by the system, as well as the ratios of vascular proliferation within the ROI on a comparison data set.
Results:
The trained network achieved a sensitivity of 90.21% with 97.67% specificity for the diagnosis of stage I ROP, 92.75% sensitivity with 98.74% specificity for stage II ROP, and 91.84% sensitivity with 99.29% sensitivity for stage III ROP. When the system diagnosed normal images, the sensitivity and specificity reached 95.93% and 96.41%, respectively. The widths (in pixels) of the demarcation lines or ridges for normal, stage I, stage II, and stage III were 15.22 ± 1.06, 26.35 ± 1.36, and 30.75 ± 1.55. The ratios of the vascular proliferation within the ROI were 1.40 ± 0.29, 1.54 ± 0.26, and 1.81 ± 0.33. All parameters were statistically different among the groups. When physicians integrated quantitative parameters of the extracted features with their clinic diagnosis, the κ score was significantly improved.
Conclusions:
Our system achieved a high accuracy of diagnosis for stage I to III ROP. It used the quantitative analysis of the extracted features to assist physicians in providing classification decisions.
Translational Relevance:
The high performance of the system suggests potential applications in ancillary diagnosis of the early stages of ROP.
Retinopathy of prematurity (ROP) is a vasoproliferative disorder that occurs in premature infants with lighter weights and shorter gestation periods.
1 This disease is a leading cause of childhood blindness. As the survival rate of preterm infants is increasing, the number of children with ROP is also increasing.
2 In the 1980s, the International Classification of Retinopathy of Prematurity was developed,
3 which was revised in 2005.
4 In 2021, the third edition was published.
5 According to the guide, the diagnosis of ROP involves three dimensions: stages I to V, zones I to III, and the presence of pre-plus disease or plus disease.
It is important to diagnose ROP accurately and timely, by clinical fundus examination or by reading retinal images. However, since classification guidelines provide only qualitative indications, this leads to a diagnostic result that depends mainly on the ophthalmologists' subjective decisions.
6 In addition, diagnostic differences also exist when different experts use different hardware in different regions. All of these lead to inconsistent diagnostic results for ROP.
To address this problem, many experts have developed semiautomated quantitative analysis tools to diagnose ROP more objectively. Their results include ROP Tool,
7 principal spanning forests algorithms,
8 computer-aided retinal image analysis,
9 and so on. However, these methods were not completely automatic, and humans needed to determine features and cut points. In general, the output did not correlate well enough with clinical diagnoses to be widely used.
10
Deep convolutional neural networks (DCNNs)
11 have shown great advantages in many medical image applications.
12–15 DCNNs provide a fully automated, end-to-end solution and do not need manual input, which is a huge advantage.
Plus disease, which has been studied by many experts, is an important feature in determining the need for treatment for ROP. In 2016, Worrall et al.
16 began to apply DCNNs in the diagnosis of Plus disease for premature infants. Brown et al.
17 studied a completely automatic system, which was able to classify retinal images as normal, pre-plus disease, and plus disease with great accuracy.
In another direction, the diagnosis of early stage ROP has also been researched. This is not only because the diagnosis of stages relies mainly on subjective interpretations,
6 but the diagnoses between stages I and III are also crucial,
4 which allow doctors to recommend the appropriate treatment while blindness is still preventable. In contrast, patients with stages IV to V have already had irreversible damage to the retinas.
In 2018, Hu et al.
18 applied DCNNs in the diagnosis of stage I to III ROP. Mulay et al.
19 and Ding et al.
20 diagnosed stages of ROP using segmented images based on DCNNs. This is not only because stages I to III and normal retinas are more subtly classified by the existence, size, and shape of the demarcation line or ridge as well as vascular proliferation, but also these features are well fitted to be obtained by segmenting using DCNNs.
DCNNs have been found to have improved performance in medical image fields.
21 However, they also have limitations in that the features on which DCNNs rely are not transparent or explainable.
22 The limitation presents great challenges for the adoption of DCNNs because medical accountability is important and may lead to serious legal consequences. An ideal system should be able to provide not only objective results but also the reasons behind them. Many experts tried to have more explainable DCNNs by combining DCNNs with traditional feature extractions. Similar work had been done by Mao et al.
23 and Yildiz et al.
24 in the diagnosis of plus disease.
In this study, we developed an automated DCNN-based system. Using segmented images, we trained a classifier to categorize images into four categories: normal, stage I ROP, stage II ROP, and stage III ROP. By evaluating the feature parameters extracted by our system, we showed significant differences among different categories. In addition, we showed the role of these parameters in improving the consistency of the manual diagnosis. To the best of our knowledge, this was the first attempt to quantitatively analyze the segmented features for diagnosis of early stage ROP.
All images of premature infants were collected from January 2018 to December 2020 at the Ophthalmology Department of Jiaxing Maternal and Child Health Hospital by the Retcam3 (Natsu Medical, Inc., Pleasanton, CA, USA) imaging system. These images were collected from five standard fields of view (posterior, nasal, temporal, superior, and inferior) and were 1600 × 1200 in size (in pixels).
After abandoning low-quality images, we selected 18,827 retinal images collected from preterm infants with a gestational age of fewer than 37 weeks and a weight less than 2000 g. We invited several experts (more than three) to remove low-quality images by consensus selections according to the following criteria:
- 1. Less than 25% of the peripheral area of the retina is unobservable due to artifacts, including the presence of foreign bodies, out-of-focus imaging, blurring, and extreme light conditions.25
- 2. Insufficient focus of the images with blood vessels is the reference.
We constructed three data sets: a training data set to train DCNNs, a test data set to test the performance of the network, and a comparison data set to compare DCNN predictions with manual diagnosis. We assigned a reference diagnostic criterion (stage I ROP, stage II ROP, stage III ROP, or normal) to each image in training and test data sets. The reference diagnostic criteria were determined by the consensus diagnosis provided by three ROP experts and compared to the previous clinical diagnosis.
Table 1 describes the characteristics of the three data sets, with 14,626, 3680, and 521 retinal images originating from 2260, 567, and 73 different preterm infants, respectively. Multiple images of different standard view fields were acquired for each eye, which led to a significant increase in the number of images.
Table 1. Characteristics of the Three Data Sets
Table 1. Characteristics of the Three Data Sets
To train the vessel segmentation network, we selected 1825 (204 infants) retinal images from the training data set and trained and tested the network in a ratio of 1464 (162 infants) to 361 (42 infants). An ophthalmologist annotated the retinal vessels using the brush annotation tool in dedicated annotation software. To train the segmentation network for the demarcation line or ridge, we selected 2738 (306 infants) retinal images from the training data set and trained and tested the network in a ratio of 2196 (243 infants) to 542 (63 infants). An ophthalmologist, using dedicated standard software, drew a boundary polygon around the demarcation line or ridge and labeled the polygon regions.
The test data set consisted of 3680 images (567 infants), of which 2893 were normal (447 infants), 378 were stage I ROP (58 infants), 262 were stage II ROP (40 infants), and 147 were stage III ROP (22 infants). The timing of retinal screening, the time interval of follow-up, and the time points for treatment of threshold ROP and prethreshold ROP were performed in strict accordance with the guidelines. Children were treated as soon as they developed the threshold ROP or prethreshold ROP. Therefore, there were no cases of stage IV and V ROP in this study.
All images in the comparison data set were collected from 73 infants. We extracted feature parameters to analyze the differences among the groups and the correlation between the features and stages. We invited two ophthalmologists with different experience to perform image diagnosis separately, to study the significance of the system based on quantitative analysis in assisting the clinic diagnosis of stages of ROP.
We used the area under the curve (AUC) score under the receiver operating characteristic curve to measure the performance of the classifier during training. To avoid overfitting and underfitting, we divided the training data set into five parts, randomly selected four parts for training, and used the remaining for testing. The cross-validations were repeated five times (fivefold cross-validation) to obtain AUC scores, and 95% confidence intervals were calculated using the formula of Hanley and McNeil.
35 We used the Scikit-Learn library tools (French Institute for Research in Computer Science and Automation, Rocquencourt, France) to calculate the AUC scores.
On the basis of the AUC scores, we selected the best configuration and conducted performance tests. To measure the performance of the classifier, we calculated the sensitivity and specificity of the results on the test data set.
The guidelines
4 tell us that stage I to III ROP and normal retinas are more subtly classified by the existence, size, and shape of the demarcation line or ridge as well as vascular proliferation. Therefore, our system evaluated the widths of the demarcation line or ridge and the ratios of vessel proliferation within the ROI.
A one-way analysis of variance (ANOVA) was conducted on the extracted feature parameters from images of the comparison data set to obtain the differences between groups of stages I to III. For ANOVA, we performed a χ2 test and used S-N-K and Duncan's assumption of equal variance.
Due to the temporal sequential nature of the stages, we conducted ordered logistic regression on the extracted feature parameters. We performed a parallel line test to make sure that the autoregressive coefficients of the independent variables were always constant. We also set 95% confidence intervals, as well as the maximum stepwise quadratic score to 5 and the maximum number of iterations to 100. Finally, we chose the feature parameters of stage III as a reference.
We also invited two ophthalmologists with different experience to perform manual diagnoses on the comparison data set. We used the results of DCNN predictions and manual diagnosis to calculate κ values to investigate the role of our system in improving the consistency of clinical diagnostic results.
All networks were implemented in Tensorflow1.10 (NVIDIA, Santa Clara, CA, USA) and evaluated on a computer with an NVIDIA GeForce TITAN XP GPU. All statistical analyses were performed using the statistical software SPSS Statistics 26.0 (IBM, Armonk, NY, USA).
On the basis of the segmented retinal images of the comparison data set, we evaluated the widths of the demarcation line or ridge and the ratios of vessel proliferation within the ROI. We performed a one-way ANOVA on feature data, and the results are shown in
Table 3.
Table 3. One-Way ANOVA for Feature Parameters
Table 3. One-Way ANOVA for Feature Parameters
The results of the one-way ANOVA test are shown in
Table 3 widths (in pixels) of the demarcation line or ridge for stage I, stage II, and stage III ROP were 15.22 ± 1.06, 26.35 ± 1.36, and 30.75 ± 1.55, respectively, while ratios of the vascular proliferation within the ROI were 1.40 ± 0.29, 1.54 ± 0.26, and 1.81 ± 0.33, respectively. All values were statistically different among the groups (
P < 0.001, 95% confidence interval). Note that inputs of segmented normal images were all black (background) and therefore not counted in
Table 3.
The mean parameters of the segmented features are shown in
Figure 5. All parameters increased significantly from stages I to III and reached the highest values in stage III.
We also performed ordered logistic regression on feature parameters. The parallel test met the requirements (P > 0.05), the model fit χ2 value was 429.112 (P < 0.001), and the Cox–Snell value was 0.882, which indicated that the regression model explained up to 88.20% of the parameters.
We used the feature parameters of stage III ROP as a reference to dividing the model into two binary logistic models. The results are in
Table 4. We found that the classification of stage I to III ROP was related to the widths of the demarcation line or crest and the ratios of vessel proliferation within the ROI (
P < 0.001, 95% CIconfidence interval), and images with larger widths were 10.89 times more likely to be classified as stage II or III. Meanwhile, images with larger vascular bifurcation ratios were 45.02 times more likely to be classified as stage II or III.
Table 4. Ordered Logistic Regression for Feature Parameters
Table 4. Ordered Logistic Regression for Feature Parameters
We invited two ophthalmologists, one with over 10 years of experience and the other with only 3 years of standardized training, to diagnose retinal images of the comparison data set.
In
Table 5, we calculated the κ score for DCNN predictions and the diagnosis from an ophthalmologist with 10 years of experience (κ = 0.9425), which was close to perfect agreement. We also calculated scores for an ophthalmologist with only 3 years of training experience. When diagnosed with original retinal images (
Table 6), the score was 0.8385, and when diagnosed with images with feature parameters, the score was 0.9268 (
Table 7). We found that the ophthalmologists could use the quantitative segmentation features as the basis for their clinic diagnostic decisions, combining manual diagnosis with quantitative parameters to improve the consistency of diagnostic results for early stages of ROP.
Table 5. DCNN Predictions and Manual Diagnosis (with 10 Years of Experience)
Table 5. DCNN Predictions and Manual Diagnosis (with 10 Years of Experience)
Table 6. DCNN Predictions and Manual Diagnosis (with 3 Years of Training)
Table 6. DCNN Predictions and Manual Diagnosis (with 3 Years of Training)
Table 7. DCNN Predictions and Manual Diagnosis (images with quantitative parameters)
Table 7. DCNN Predictions and Manual Diagnosis (images with quantitative parameters)
In this study, we developed an automatic diagnostic system based on DCNNs. We trained the system using segmented images within the ROI, which could provide a diagnosis of stage I to III ROP with extracted parameters. We also performed a quantitative analysis of these parameters.
Unlike the Mask R-CNN architecture used by Ding et al.,
20 we used two Retina U-Nets and a Dense Net. Retina U-Net combined the Retina Net,
37 a one-stage detector, and the structure of U-Net,
38 which could preserve the location information in images well. We calculated the ROI to extract the features, and by this method, we completed the data compression. Throughout the whole process, the size of images was always kept to the original, instead of resizing images to 299 × 299 (in pixels) and training the system after randomly slicing images, as Ding et al.
20 did. A study done by Kim et al.
39 showed that retinal appearance assessment based on the whole image provided a more accurate and reliable DCNN classification compared to quadrant-based assessment.
With the Dense Net as a classifier, we achieved an overall accuracy of 97.98% for all four categories of the test data set, with a κ score of 0.9425. In a similar work, Ding et al.
20 obtained an overall accuracy of 67%. Of course, we cannot simply compare the performance metrics because of the different data sets.
More important, we not only provided an automatic classifier based on DCNN but also performed a quantitative analysis of the extracted feature parameters. The results of the statistical analysis of the parameters of the widths of the demarcation lines or ridges and the ratios of the vascular proliferation within the ROI showed that all quantitative parameters increased significantly among different groups. This may help to enhance the explainable DCNN predictions.
The results of the ordered logistic regression for these parameters showed that ratios of the vascular bifurcation within the ROI had a greater odds ratio value (45.015 vs. 10.892), which suggests that ratios played a greater role in the diagnosis of stages II and III. Second, the quantitative parameters of the ratios in
Figure 5 showed a smaller difference between stages I and II, suggesting that the system relied more on the quantitative width parameters when diagnosing stage I, which explained why we obtained only 90.21% sensitivity in stage I. Finally, the logistic model explained only 88.2% of all parameters, which indicates that DCNNs learned more features than the two extracted ones. In the future, the dissection and visualization of features learned by DCNNs will be very interesting.
We also performed comparative tests on the comparison data set. Physicians integrated quantitative parameters of the extracted features with their clinic diagnosis, and the κ score was improved from 0.8385 to 0.9268. This suggested that our study may help alleviate the current situation of the insufficient number of hospital ophthalmologists.
40
The input images were a potential limitation for our system. We used only retinal images of sufficient quality, which were acquired at Jiaxing Maternal and Child Health Hospital using Retcam3 from preterm infants with lighter weights and shorter gestation periods. The Retcam3 imaging system is expensive, so many hospitals use other alternative devices such as PanoCam (Visunex Medical Systems, Suzhou, Jiangsu, China), and the different devices introduce differences in images. Second, during the screening of newborns, the ROP has been detected in many heavier and full-term infants. Therefore, it may not be sufficient to extract features from images of preterm infants alone. Finally, the sufficient quality images selected only may not represent reality. We may lose features in the suboptimal images.
The authors thank Jia Liu at Jiaxing Maternal and Child Health Hospital, who provided the retinal images of the preterm infants used in this study. We also thank doctors in ophthalmology who provided much help, including manual diagnosis.
Supported by the Jiaxing Science and Technology Project, “Exploration of Eye Disease Screening Model for Infants and Children Aged 0–3 Years and Its Application in Primary Eye Care Work” (2019AD32156), and the Zhejiang Province Medical and Health System Science and Technology Project, “Exploration of Eye Disease Screening Model for Infants and Children Aged 0–3 Years and Its Application in Primary Eye Care Work” (2020KY965).
Disclosure: P. Li, None; J. Liu, None