In this work, we have shown that a deep learning segmentation model can be used for detection of retinal abnormalities associated with DR, achieving similar or better performance for recall and F1 score on several types compared to a retinal expert. The segmentation model can in turn be used in a classification method to improve the grading performance of a classification network for full-scale grading of DR in six-field retinal images. The increased performance was likely due to the segmentation model's ability to detect microvascular abnormalities that otherwise suffer from diminishing feature resolution when images are downscaled prior to development of the classification model.
The ability to recognize and accurately detect microvascular features, such as MA and also HEM, is important as these lesions present in early stages of DR and indicate the risk of progressing to more severe levels of disease.
1 The segmentation model demonstrated higher recall for detection of individual abnormalities of both types at similar levels of precision compared to a retinal expert. The segmentation model also closely matched the expert in the ability to identify images containing any of these abnormalities, suggesting that segmentation models alone may be used as a tool for identifying patients at risk of progressing to more severe stages of disease.
IRMA and neovascularizations indicate more severe DR. The segmentation model was able to identify more IRMA changes compared to the retinal expert with higher F1 scores, although with lower levels of precision. Because IRMA alone can indicate level 3 DR, the segmentation model may also assist in detecting more severe levels of disease. Compared to the retinal expert, the segmentation model somewhat struggled to accurately detect NV, with both lower recall and precision values. Identifying NV is crucial, as these may result in acute loss of vision due to vitreous hemorrhages. NV was the abnormality type with the fewest examples in the segmentation data, with only 157 instances in the training set. This is significantly less than the number of MA and PC with 11,024 and 7,148 training examples, respectively. Improving performance for this abnormality could then simply be a matter of collecting more data. This process, however, is very cumbersome as it involves annotating individual pixels.
Generally, precision was lower for the segmentation model compared to the expert for both image level detection as well as detection of individual abnormalities. The low model precision could in some cases be attributed to it confusing abnormalities with similar characteristics. This was also the case for the second expert, but was more pronounced for the model. For both the model and expert, MA were often confused with HEM and IRMA, and HEM were likewise confused with IRMA and MA, and for the model in some cases also NV. For the model, IRMA was more often than other abnormalities confused with NV, indicating a similarity between these types of abnormalities. This can perhaps be realized by looking at the example in
Figure 4. As seen in the
supplementary material, precision increased significantly for both the expert and model when all abnormalities were treated as the same class, that is, when the task was formulated as a binary segmentation/detection problem.
Highly sensitive models can lead to many false positives and this may be problematic. It could be argued that slightly oversensitive models are not necessarily problematic for detection of retinal abnormalities. As it stands, most decisions regarding treatment of DR are handled by humans. Few machines are given full autonomy when it comes to diagnosing DR, and most deep learning models developed for automatic retinal image analysis will therefore operate as clinical decision support tools. As opposed to classification algorithms, segmentation models yield semantically meaningful information directly interpretable by humans, and predictions from overly sensitive models can quickly be verified or ignored. For image level detection, the segmentation model raised a false alarm in 10 out of the 46 test images in the segmentation data set. That is, in 10 images, the model detected abnormalities but none were present according to the expert reference. Of these 10 images, 5 were incorrectly predicted to contain MA, 1 with CWS, 2 with HE, and 2 with PC. Neither HE and CWS alone indicate DR but may be used as indicators of other types of diabetic eye disease, for example, diabetic macular edema. Of the seven images that were incorrectly predicted by the model to contain IRMA, all contained at least MA. Six images also contained HEM, and five of the images in addition contained HE according to the reference. Similarly, 11 images were incorrectly predicted to contain NV, but of those 11 images, 8 of them had been annotated with IRMA, HEM, MA, HE, CWS, HE, and PC by the expert annotator. No images without any abnormalities were incorrectly predicted to contain either IRMA or NV by the model.
Although the image level precision for detecting photocoagulation scar tissue was low, this may not be reason for much concern. In most cases, clinicians will have access to patient health records wherein it is documented whether patients have received prior treatment. As such, this marker is not the most vital for clinical decision support. On the other hand, NV are indicative of DR requiring treatment, and it is therefore problematic that the model had a low recall compared to the expert for image level detection, with lower level of precision as well. As only seven images with NV were present in the test data, the conclusions drawn from these results have to be considered with some degree of uncertainty.
Segmentation and detection of retinal abnormalities can be leveraged for automatic full scale DR disease staging. Presegmentation likely helps minimize the adverse effect of diminishing feature resolution caused by downscaling images prior to development of grading models. The segmentation mask makes it easier for the grading model to recognize relevant features, as these will be more visible in the color-coded segmentation masks. Intuitively, from the point of view of a grading network, recognizing identically colored pixels indicative of specific abnormalities, for example, cyan, magenta, and green for IRMA, HEM, and MA, respectively, is a much more reasonable task compared to the raw pixels values that are affected by pigmentation and image artifacts, such as illumination. This is exemplified in
Figure 7, where the feature resolution of microvascular changes in the form of MA are shown in the original resolution image and compared to downsampled images with and without presegmented abnormalities. The problem of reduced feature resolution is also discussed by both Sahlsten et al.
12 and Krause et al.
13 In both studies, increased input image resolution during model development led to improved full scale grading performance.
Individual grades on the ICDR scale are, in some cases, defined by specific lesions and microvascular abnormalities. In the case of ICDR level 1 and level 3, MA, HEM, and IRMA may be used as indicators and by including all these abnormalities in the segmentation data set, the classification model was more likely to take these into consideration when leveraging the outputs from the segmentation model. As again illustrated in
Figure 7, MA and IRMA are especially sensitive to the adverse effect of downsampling. Looking at the chart in
Figure 6, it can be seen how presegmentation of abnormalities leads to improved grading accuracy for these two levels in particular.
The general idea of leveraging segmentation of retinal abnormalities for improved disease staging is analogous to the method by De Fauw et al.
34 for diagnosis of retinal disease in optical coherence tomography images and also by Ling et al.
35 for grading DR across multiple levels in retinal images.
As illustrated by the examples in
Figures 3,
4, and
5, the segmentation masks created from model predictions were not fully accurate. In some cases, segmentation errors were caused by noise or artifacts in the images, such as underexposure. Hence, it was beneficial to include the raw features as well, rather than relying solely on the segmentation mask when developing the grading model. In some cases, the segmentation model incorrectly detected PC or NV in images that otherwise contained no abnormalities, or with only mild levels of pathology, which could have caused the classification model to incorrectly classify images as level 4 based on the presence of these abnormalities if relying solely on the segmentation mask. We believe that including the raw image features enabled the classification model to reason about the general makeup of an image and take into account the image artifacts that may have caused the segmentation model to fail.
Using the segmentation masks in the grading pipeline also helped to decrease the opaqueness of model predictions by providing semantically meaningful information on what features the model considered when grading images. When comparing the images in
Figure 8, it can be seen that the Grad-CAM heatmap from the network trained on presegmented images is more focused on the area where the segmentation model had identified microaneurysms, which are defining of level 1 DR, whereas the corresponding heatmap for the model trained on raw features is more spread out. In the same way, it can be seen in
Figure 9 that the model trained on presegmented images was seemingly able to use the IRMA changes detected by the segmentation model to correctly identify the image as representing DR level 3. In comparison, the heatmap from the model trained on raw image features reveals that this network has more or less ignored the regions with IRMA changes, likely causing it to misclassify the image as level 2.
Increased model interpretability is likely going to be a factor in implementing computer assisted diagnostic tools in clinical practice in the future. The proverbial black box nature of convolutional neural networks may serve as a barrier in this regard. Methods such as Grad-CAM aim to resolve this issue by using internal model representations of the image to compute the features most relevant for predictions. Although this method has worked very well for images of more general nature, for example, pictures of animals or everyday objects such vehicles and household items,
33 it does not yield the same degree of meaningful information when dealing with the high-resolution retinal images used in this study. This is again illustrated by the example in
Figure 8, where the heatmaps in both the case of raw feature image and presegmented image are very coarse. Were it not for the presegmented MA, neither image would provide a lot of useful insight into the model's decision making. The shortcomings of the method in the context of medical imaging likely relates to the combination of high-resolution images and more or less microscopic disease markers. In order to get a good indication of important image features, the internal representation is taken from the deep layers of the network where the resolution, that is, height and width of the image, is even smaller than the original input resolution to the network. When the information from this layer is projected back onto the input image, the granularity is decreased, resulting in these types of coarse heatmaps. Thus, the presegmentation approach not only helps to improve grading accuracy, but also significantly increases model interpretability.
When using the segmentation masks in the classification pipeline there was a risk that the grading model would become overly reliant on these and perhaps ignore other features that may be relevant. We attempted to avoid this issue by including most of the known retinal abnormalities in DR in the segmentation data set, including HE and CWS, and not only those defined in the ICDR scale as indicators of different DR disease levels. Currently, disease staging is based on definitions made by human experts. Although these definitions are built on years of cumulative knowledge by many experts, they still may not be perfect in regard to accurately estimating the risk of disease progression or blindness. We imagine that deep learning models may be used for constructing better risk stratification models in the future, and it was therefore a priority to include as much information as possible in the data set to allow models to take this into account.
Making pixel level annotations of abnormalities is an enormously straining and tiresome task, not least in the case of DR, as these are mostly microscopic and hard to discern, even for domain experts. When comparing one expert against another, or a model against an expert as in this study, there is a risk that the results have been influenced by annotators suffering from fatigue. Based on the high level of agreement and consistency demonstrated in our previous study on the agreement between the two experts,
29 it is our view that it has not affected the results presented here.
In this study, full scale grading of DR in six-field retinal images from Danish patients has been demonstrated for the first time, with results comparable to those demonstrated for two-field retinal images by. Sahlsten et al.
12 and single field images by Krause et al.
13 The average per class accuracy for the 5 levels on the ICDR scale was 60.2% and 72.6% in Sahlsten et al.
12 and Krause et al.,
13 respectively. Sahlsten et al.
12 also report a multiclass macro area under the curve value of 0.96. Quadratic weighted Kappa is reported by both Sahlsten et al.
12 and Krause et al.
13 with values of 0.91 and 0.84. In comparison, the method described in this study using presegmented images, average per class accuracy was 70.4%, macro area under the curve was 0.92, and quadratic weighted kappa was 0.90.
In this study, the classification models were developed for grading DR across all levels of disease in the ICDR scale. Deep learning models, such as that by Gulshan et al.
5 that perform binary classification of nonreferable or referable DR, can serve as tools for reducing the strain on healthcare systems by referring only patients with moderate or worse DR to consultations with retinal experts. Automatically grading disease across all levels may hold additional value in regard to reducing healthcare expenditure. Although some countries perform regular screenings of patients regardless of their level of disease, the screening system in Denmark assigns individualized screening intervals based on, among other things, the specific ICDR disease level.
36 In this setup, the difference in screening interval between level 2 DR and level 3 DR could be as high as 21 months, and, in either case, the patient is not deemed to be in immediate need of medical attention.
Although comparisons are made in this study between the segmentation model and a human expert for detection of retinal abnormalities, this is not the case for full scale grading of DR. At the time of writing, the image grades in the classification data set have been assigned on the basis of electronic health records and the data set has not been subject to further adjudication by retinal experts. The importance of adjudication and expert validation of data sets has been discussed by Gulshan et al.
5 and Krause et al.
13 and this is something that will need to be addressed in the future. Comparisons between a CNN and retinal specialists for full scale grading of DR is made by Krause et al.,
13 where the quadratic weighted kappa values for human experts ranged from 0.80 to 0.91. As such, the method presented here could be argued to perform on the level of human experts.
The results presented in this study suggest that segmentation models can serve as an additional tool for clinical decision support and automated grading of DR. By the virtue of the unique segmentation data set presented here, along with adjudication of classification data, it should be possible to develop more effective models in the future.