The effectiveness of the method is validated in two ways by ninefold cross-validation as shown in
Table 2. First, we validate the algorithm using a three-class disease classification—the first-level nodes of the taxonomy, which represent vascular diseases, OD diseases, macular diseases. For this inference class classification task, the CNN achieves 85.2% ± 0.7% (mean ± SD) overall accuracy (the average of individual inference class accuracies), and two comprehensive ophthalmologists get 78.56% and 76.3% accuracy on a subset of the validation set. Second, we validate the algorithm using a 18-class pathology classification—the second-level nodes of the tree. The CNN achieves 81.4% ± 1.7% overall accuracy, and the same two ophthalmologists get 75.2% and 73.2% accuracy, respectively.
Figure 3 shows a few example images, demonstrating the difficulty in distinguishing between the fundus diseases because of the many similar visual features. The comparison metrics are sensitivity and specificity:
\begin{eqnarray}\begin{array}{@{}l@{}} sensitivity = \frac{{true\;positive}}{{positive}}\\ specificity = \frac{{true\;negative}}{{negative}} \end{array}\nonumber\end{eqnarray}
Where “true positive” is the number of correctly predicted fundus diseases, “positive” is the number of certain diseases the CNN predicted, including true-positive and false-positive results, “true-negative” is the number of correctly eliminated diseases, and “negative” is the number of diseases the CNN eliminated, including true-negative and false-negative results. When inputting an image into the CNN, we get a probability
P of the 18 diseases it predicted as output. By setting a threshold probability
t, we can finally determine one disease the image belongs to as
\(\hat y\)for the image, whereas
\(\hat y = P \ge t\). The sensitivity and specificity of these probabilities can be computed, by changing
t in the interval 0–1, a curve of sensitivities and specificities of the CNN about the classification effectiveness can be generated as shown in
Figure 4. The area under the curve (AUC) measures the performance of the CNN, whose maximum value is 1. As shown by the AUC on
Figure 4a, the deep learning CNN exhibits reliable fundus disease classification. Each red point on the plots represents the sensitivity and specificity of a single ophthalmologist. We can see that the CNN's performance is superior to the two ophthalmologists because the red points are under the blue curve of the CNN. When tested on a larger dataset (macular diseases: 800 images, vascular diseases: 720 images, OD diseases: 808 images;
Fig. 4b), we found the tiny changes in the AUC compared with the small dataset, which show the robust and reliable classification performance on larger dataset.