Abstract
Purpose:
Introducing a new technique to improve deep learning (DL) models designed for automatic grading of diabetic retinopathy (DR) from retinal fundus images by enhancing predictions’ consistency.
Methods:
A convolutional neural network (CNN) was optimized in three different manners to predict DR grade from eye fundus images. The optimization criteria were (1) the standard cross-entropy (CE) loss; (2) CE supplemented with label smoothing (LS), a regularization approach widely employed in computer vision tasks; and (3) our proposed non-uniform label smoothing (N-ULS), a modification of LS that models the underlying structure of expert annotations.
Results:
Performance was measured in terms of quadratic-weighted κ score (quad-κ) and average area under the receiver operating curve (AUROC), as well as with suitable metrics for analyzing diagnostic consistency, like weighted precision, recall, and F1 score, or Matthews correlation coefficient. While LS generally harmed the performance of the CNN, N-ULS statistically significantly improved performance with respect to CE in terms quad-κ score (73.17 vs. 77.69, P < 0.025), without any performance decrease in average AUROC. N-ULS achieved this while simultaneously increasing performance for all other analyzed metrics.
Conclusions:
For extending standard modeling approaches from DR detection to the more complex task of DR grading, it is essential to consider the underlying structure of expert annotations. The approach introduced in this article can be easily implemented in conjunction with deep neural networks to increase their consistency without sacrificing per-class performance.
Translational Relevance:
A straightforward modification of current standard training practices of CNNs can substantially improve consistency in DR grading, better modeling expert annotations and human variability.
Despite the benefits in accuracy provided by label smoothing in a wide array of computer vision tasks, this technique is not suitable for every problem in a generalistic way. In particular, the smoothing scheme should ideally depend upon any potential underlying structure present on data annotations. While previous works have mainly applied label smoothing on annotations that do not contain such structure, for the case of DR grading, a conceptual distance between disease stages exists. Consequently, we propose to modify the standard label smoothing regularization technique by simply replacing the one-hot encoded annotation
yk by a Gaussian distribution centered at
yk with a decay factor (standard deviation) σ selected in such a way that 95% of the probability mass still falls within its neighboring grades. Mathematically, this would be described as
\begin{eqnarray*}{y_k} = {G_k}_{,\sigma }\left( {{y_k}_{}} \right).\end{eqnarray*}
A graphical representation of the proposed regularization scheme is displayed in
Figure 1 (top right). We refer to this modified technique as non-uniform label smoothing (N-ULS) in the remaining.
It should be noted that the “degree of truth” remaining in each label after smoothing using the uniform label smoothing (ULS) and N-ULS approaches is kept constant across all grades. There is, however, some “missing probability” in the corner grades (grades 0 and 4) for the N-ULS case; this could be easily handled by renormalizing the probability mass so that it adds up to 1 in these two grades. However, this would also cause a somehow asymmetric behavior of the N-ULS technique when compared to ULS, as it would place more “degree of truth” in these particular grades. There is necessarily a decision to make in this case, and in this article, we opt for the unnormalized implementation, supported also by our preliminary experimental analysis (we did not observe any noticeable performance difference between the unnormalized and the renormalized strategies).
By implementing N-ULS, we expect to bias the learning of a DR grading CNN toward a model that, when mistaken, produces more consistent errors. This is because N-ULS reflects in a more suitable manner interobserver disagreements: two human graders differing in their opinion will most likely do so by neighboring grades than by faraway ones. N-ULS introduces in this way new information into the optimization process, since when the CNN observes a new data point with associated annotation, it must also learn a notion of the underlying DR grading structure.
The above analysis leads to several conclusions. First, results reported in
Table 2 for the ResNet50 case demonstrate that the introduction of standard LS in the training of this network seems to generally harm performance when compared to only using conventional CE. This was also verified in a separate statistical test (not included for brevity), in which we found that using the CE loss in this case resulted in an increase of 1.91 percentage points in the κ score (
P = 0.046) with respect to using LS. This was not the case for the N-ULS strategy introduced in this article: for all considered metrics, either the performance was significatively increased, or the performance decrease was not statistically significant. Second, and more important, the quadratic κ score was substantially higher for the N-ULS approach, which confirms our hypothesis that the error consistency can be improved by means of a simple domain-specific label smoothing strategy. In both cases, the quadratic-weighted κ score was statistically significantly better when optimizing the network with the N-ULS technique as compared to the other two approaches, verifying the validity of our findings.
Our performance analysis on other metrics for the ResNet50 case clearly demonstrates the benefits of implementing N-ULS over training a CNN with the ordinary CE loss. We also observe for these metrics that LS actually harms performance when compared to CE, but N-ULS recovers much of this performance loss, even rising slightly above CE results. Remarkably, while similar performance levels are obtained by CE and N-ULS in terms of average AUROC, precision, and recall, a model trained with N-ULS significantly outperforms the standard CE version in terms of correlation measurements like F1 score or MCC, aside from the greater quadratic-weighted κ scores observed above.
When repeating the above analysis with the ResNet101 CNN, other interesting consequences arise. First, overall performance when using N-ULS is maintained for every considered metric. Second, the performance of the CE loss without any regularization is considerably degraded in terms of quadratic κ, dropping from 0.732 to 0.719. In contrast, the quadratic κ attained by LS increased from 0.712 to 0.7452. A separate statistical testing of LS versus CE in this case resulted in observing an increase of quadratic κ score equal to 2.54 percentage points (P = 0.019). A similar trend can be observed in all the other performance metrics. In general, training a more powerful architecture comes with a greater risk of overfitting, and in this case, LS seems to be successful in reducing this phenomenon, which decreases the performance of the same network trained with standard CE. In any case, N-ULS remains equal or superior to either approaches in terms of every considered performance metric, hinting at its usefulness as a regularization technique independently of the CNN complexity.
The reported experimental results demonstrate that the method introduced in this article for CNN regularization is useful in the context of DR grading. One of the major challenges in extending conventional deep learning–based approaches from DR detection or screening (binary problems) to DR grading (a multiclass scenario) lies in ensuring that the underlying structure of expert annotations is well captured by the network. The approach introduced in this article is a straightforward step toward this goal. As a secondary benefit, N-ULS helps in combating data imbalance (one or several classes having disproportionately less training samples than the other ones), which is a typical obstacle in DR grading (see
Table 1). It does so by attaching extra information to each example: the smoothed label corresponding to an image annotated with a particular DR grade conveys the information not only of its own grade but also of which are its neighboring grades.
It is worth mentioning that N-ULS might also be useful as an approach to handle the disproportionate difficulty of correctly classifying fundus images corresponding to the DR1 class. The typical low performance in this category among all existing techniques is explained by the fact that symptoms of mild DR involve the presence of few microaneurysms, which are subtle, easily confused with other visual artifacts, and hard to find even for human experts.
16 Since algorithms are trained on data sets annotated by human experts, annotations inherit such ambiguity, which is particularly high in this grade of the disease. This is also another motivation for the proposed technique. Since formulating
perfect predictions is not even possible for experts, it might be more useful to at least make sure that a model formulates instead
reasonable predictions, in line with the error consistency improvement properties of N-ULS.
N-ULS can be incorporated into existing methodologies that employ standard CE loss functions in order to more appropriately reflect such structure. It is important to remark that the N-ULS regularization scheme is independent of the CNN architecture and could be equally useful in the context of other grading problems like diabetic macular edema prediction. Future work will involve the extension and validation of this technique to other disease grading problems.
Disclosure: A. Galdran, None; J. Chelbi, Diagnos INC (E); R. Kobi, Diagnos INC (E); J. Dolz, None; H. Lombaert, None; I. ben Ayed, None; H. Chakor, Diagnos INC (E)