In the present work, a deep transfer learning approach with Inception-v3 network was presented to identify the level of DR from retinal fundus photographs automatically that achieved high accuracy, sensitivity, and specificity. This approach avoided a great deal of example images for convergence of the model via fine-tuning the weights of the Inception-v3 network, which was pretrained using the ImageNet dataset and yielded a matching or exceeding performance to that of retinal specialists in detecting DR images.
26,27 The results attained indicated that our approach could provide better consistent predictions and highly reliable detection without having to specify lesion-based features, and it could serve as an automated screening tool for early DR by using retinal fundus images in addition to assisting ophthalmologists in making a referral decision.
Retinal fundus image interpretation is often subjective and liable to significant inter- and intraobserver variability, even among experienced ophthalmologists. Considering these limitations, automated DR detection methods would be of enormous value. The proposed approach for DR detection offered consistency of interpretation on a specific image. The performance of the proposed approach was yielded directly by the results of the training data, with a human expert grading decisions, without the need to focus on underlying process of DR. In addition, when performing a large-scale screening for DR, it was critical to improve sensitivity and specificity for minimizing misdiagnosed cases. Our approach offered good results in the sensitivity and specificity, while a near instantaneous reporting ability of results could also be achieved. In this study, 93.49% (95% CI, 93.13%–93.85%) accuracy, 96.93% (95% CI, 96.35%–97.51%) sensitivity, and 93.45% (95% CI, 93.12%–93.79%) specificity were generated, while the AUC was up to 0.9905 (95% CI, 0.9887–0.9923), manifesting a comparable or slightly better performance than the previous studies.
17,18,28 Moreover, the most significant merit of our approach was possibly the endeavor to simultaneously predict five levels of DR with improved performance compared to previous studies where only four DR grades were considered,
29 which was suitable for more timely and reliable detection of DR.
Most artificial intelligence studies using retinal fundus images concentrated on explicit handcrafted feature engineering involving computing and extracting complex features,
30 which was time-consuming, required considerable skill and professional knowledge for annotating the imaging data, and easily resulted in misclassification due to a minor error in handcrafted engineering features. However, the key advantage of our approach was that it could learn automatically richer and more distinctive image features from the retinal fundus image data to achieve more accurate identification, without manual feature extraction or feature optimization. This autonomous behavior could present a potential opportunity for capturing subtle characteristics or patterns of DR in clinical settings, which may not be identified by retinal experts. Additionally, the approach developed in this study did not require any specialized or advanced computer equipment to classify fundus photographs, and it could be deployed on standard low-cost computing equipment to offer reproducible evaluation of DR images in patients with suspected DR diseases.
In our system, we transferred the pretrained Inception-v3 model and adjusted the last fully connected layer to five output categories exactly corresponding to our multiclass identification task instead of the 1000 output categories of the ImageNet. Subsequently, weights of the pretrained Inception-v3 model were placed into the transferred model, while weight parameters in the last fully connected layer were randomly initialized. During the fine-tuning, the strategies for the hyperparameter setting and searching were different from that in the training process. First, the initial learning rate was set to much less than 0.1 through the optimization algorithm for training the model well during the fine-tuning. Moreover, there was no need to update all weight parameters of the model due to limited training data. The most effective way to fine-tune the pretrained weight parameters of the CNN model was to adjust only those parameters in the fully connected layer most relevant to specific fundus photograph classification, while fixing the weight parameters of convolutional layers and the corresponding pooling layers. In our system, the fine-tuning process was performed for 50,000 steps using the SGD optimizer with a batch size of 100. The learning rate was initially set to 0.001 and was then decreased linearly from 0.001 to 0.0001 over 150 epochs in the training process. The categorical cross entropy loss function was utilized, whereas the values of weight decay and momentum were set to 0.0005 and 0.95, respectively. During retraining, the frozen layers were attempted to further fine-tune through unfreezing and adjusting the corresponding pretrained weight parameters on the developed DR photograph dataset using a back propagation approach until the performance of the validation dataset could not be further improved. Once the optimal learned weights were determined, the work procedure of our system was in according with that of the conventional CNN.
Limitations of transfer learning approach must be considered. First, although the transfer learning approach achieved satisfactory results in detection of DR and decreased the training time for model convergence on a relatively small training dataset, it exhibited slightly inferior classification power in contrast to that of the model trained from scratch on a huge training dataset. The reason was mainly due to the fact that the weight parameters of the model trained from scratch could be directly updated and optimized for DR feature identification. Second, when source and target domains have little relevance to each other in some sense, transfer learning may lead to a decline of performance. Also, the performance of the developed model was determined by weight parameters of the pretrained Inception-v3 model to a large extent, which could be further improved by testing with a larger ImageNet dataset. Also, when an attempt to fine-tune the network via unfreezing and updating pretrained weight parameters on the developed DR image dataset with the back propagation method, overfitting was prone to occur, resulting in a decline of model performance. However, transfer learning accelerated the training of the model, reduced memory complexity, and yielded a high classification accuracy between no apparent DR, mild NPDR, moderate NPDR, severe NPDR, and PDR on a relatively small DR photograph dataset. The deep transfer learning approach with Inception-v3 network could accurately capture features of DR images; as a result the relative high performance in automated DR identification could be generated. Unfortunately, it could be extremely expensive or unfeasible to collect a large amount of DR images as the underlying datasets with the gold standard qualified by ophthalmologists. However, even if they could be collected, the training of a deep CNN would also require extensive memory, computational resources, and several weeks to update the substantial hyperparameters of the model to converge to a high accuracy. In contrast, a multiclass holdout model trained through the use of a deep transfer learning approach could save memory, reduce computational resources, and only take approximately 2 hours to finish training, validating, and testing in the corresponding datasets. We also trained four constituent binary classifications identifying mild NPDR/moderate NPDR/severe NPDR/PDR from no apparent DR and the limited model independently. Each binary classification and the limited model showed excellent performance and could also generate a relatively high accuracy in about 1 hour. Thus, the initializations of models with the deep transfer learning approach should be regarded as a critical method when a CNN was trained for performing a new task, especially with limited data.
Nevertheless, there also exists several limitations to our approach in the current study. First, we selected the retinal fundus images from only two hospitals. Various device settings, camera systems, and population characteristics had impacts on DR images and further affected the model's performance. In order to further evaluate our approach, we need to collect more retinal fundus image data from different hospitals and use larger patient cohorts in future studies. Second, since the deep learning model was referred to as black boxes, it was difficult to know how the algorithm analyzes features and makes predictions for DR images. In particular, when difficult and ambiguous cases occurred, it was very useful to make an objective interpretation. Thereby, the visualization of the decision-making process of the model needs to be studied further. This raised the possibility that visualizing model decisions could potentially aid both patients and physicians in real-time clinical verification. Third, our approach learned the features based only on the fundus images and their associated grades, rather than explicit, defined features. Therefore, it is possible that the algorithm was using some features ignored by humans to predict classification results. In subsequent studies, we need to gain insight into how the deeply neural network analyzes patterns and makes an image-wise prediction.