Open Access
Artificial Intelligence  |   May 2025
Grading of Foveal Hypoplasia Using Deep Learning on Retinal Fundus Images
Author Affiliations & Notes
  • Tsung-Ying Tsai
    Department of Ophthalmology, Chang Gung Memorial Hospital, Taoyuan, Taiwan
  • Ying-Feng Chang
    Artificial Intelligence Research Center, Chang Gung University, Taoyuan, Taiwan
    Department of Gastroenterology and Hepatology, New Taipei Municipal Tu Cheng Hospital (Built and Operated by Chang Gung Medical Foundation), New Taipei City, Taiwan
  • Eugene Yu-Chuan Kang
    Department of Ophthalmology, Chang Gung Memorial Hospital, Taoyuan, Taiwan
    College of Medicine, Chang Gung University, Taoyuan, Taiwan
  • Kuan-Jen Chen
    Department of Ophthalmology, Chang Gung Memorial Hospital, Taoyuan, Taiwan
    College of Medicine, Chang Gung University, Taoyuan, Taiwan
  • Nan-Kai Wang
    Department of Ophthalmology, Chang Gung Memorial Hospital, Taoyuan, Taiwan
    College of Medicine, Chang Gung University, Taoyuan, Taiwan
  • Laura Liu
    Department of Ophthalmology, Chang Gung Memorial Hospital, Taoyuan, Taiwan
    College of Medicine, Chang Gung University, Taoyuan, Taiwan
  • Yih-Shiou Hwang
    Department of Ophthalmology, Chang Gung Memorial Hospital, Taoyuan, Taiwan
    College of Medicine, Chang Gung University, Taoyuan, Taiwan
  • Chi-Chun Lai
    Department of Ophthalmology, Chang Gung Memorial Hospital, Taoyuan, Taiwan
    College of Medicine, Chang Gung University, Taoyuan, Taiwan
  • Sin-You Chen
    Artificial Intelligence Research Center, Chang Gung University, Taoyuan, Taiwan
  • Jenhui Chen
    Department of Computer Science and Information Engineering, Chang Gung University, Taiwan
    Division of Breast Surgery and General Surgery, Department of Surgery, Chang Gung Memorial Hospital, Taiwan
    Department of Electronic Engineering, Ming Chi University of Technology, Taiwan
  • Chao-Sung Lai
    Department of Electronic Engineering, Chang Gung University, Taoyuan, Taiwan
    Department of Nephrology, Chang Gung Memorial Hospital, Taiwan
    Department of Materials Engineering, Ming Chi University of Technology, Taiwan
  • Wei-Chi Wu
    Department of Ophthalmology, Chang Gung Memorial Hospital, Taoyuan, Taiwan
    College of Medicine, Chang Gung University, Taoyuan, Taiwan
  • Correspondence: Wei-Chi Wu, Department of Ophthalmology, Chang Gung Memorial Hospital, Linkou, Taiwan, No. 5, Fuxing St., Kweishan Dist., Taoyuan City, Taiwan 333423, Republic of China. e-mail: [email protected] 
  • Chao-Sung Lai, Department of Electronics Engineering, College of Engineering, Chang Gung University, No. 259, Wenhua 1st Rd., Guishan Dist., Taoyuan City, 333323 Taiwan, Republic of China. e-mail: [email protected] 
Translational Vision Science & Technology May 2025, Vol.14, 18. doi:https://doi.org/10.1167/tvst.14.5.18
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Tsung-Ying Tsai, Ying-Feng Chang, Eugene Yu-Chuan Kang, Kuan-Jen Chen, Nan-Kai Wang, Laura Liu, Yih-Shiou Hwang, Chi-Chun Lai, Sin-You Chen, Jenhui Chen, Chao-Sung Lai, Wei-Chi Wu; Grading of Foveal Hypoplasia Using Deep Learning on Retinal Fundus Images. Trans. Vis. Sci. Tech. 2025;14(5):18. https://doi.org/10.1167/tvst.14.5.18.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose: This study aimed to develop and evaluate a deep learning model for grading foveal hypoplasia using retinal fundus images.

Methods: This retrospective study included patients with foveal developmental disorders, using color fundus images and optical coherence tomography scans taken between January 1, 2001, and August 31, 2021. In total, 605 retinal fundus images were obtained from 303 patients (male, 55.1%; female, 44.9%). After augmentation, the training, validation, and testing data sets comprised 1229, 527, and 179 images, respectively. A deep learning model was developed for binary classification (normal vs. abnormal foveal development) and six-grade classification of foveal hypoplasia. The outcome was compared with those by senior and junior clinicians.

Results: Higher grade of foveal hypoplasia showed worse visual outcomes (P < 0.001). The binary classification achieved a best testing accuracy of 84.36% using the EfficientNet_b1 model, with 84.51% sensitivity and 84.26% specificity. The six-grade classification achieved a best testing accuracy of 78.21% with the model. The model achieved an area under the receiver operating characteristic curve (AUROC) of 0.9441 and an area under the precision-recall curve (AUPRC) of 0.9654 (both P < 0.0001) in the validation set and an AUROC of 0.8777 and an AUPRC of 0.8327 (both P < 0.0001) in the testing set. Compared to junior and senior clinicians, the EfficientNet_b1 model exhibited a superior performance in both binary and six-grade classification (both P < 0.00001).

Conclusions: The deep learning model in this study proved more efficient and accurate than assessments by junior and senior clinicians for identifying foveal developmental diseases in retinal fundus images. With the aid of the model, we were able to accurately assess patients with foveal developmental disorders.

Translational Relevance: This study strengthened the importance for a pediatric deep learning system to support clinical evaluation, particularly in cases reliant on retinal fundus images.

Introduction
Foveal development, occurring from fetal week 25 to 45 months after birth, involves the following processes: (1) centrifugal displacement of cells of the inner retina toward the periphery, (2) centripetal migration of cone photoreceptors toward the location of the incipient fovea, and (3) cone specialization of the foveolar cones.1,2 Any disruptions in these lead to foveal hypoplasia (FH) in conditions like albinism,3 prematurity,4 achromatopsia,5,6 PAX6 mutations,7 AHR mutations,8 or SLC38A8 mutations912 and other forms of inherited retinal dystrophies. 
The Leicester Grading System describes four grades of FH, along with an atypical grade with degenerative changes caused by photoreceptor apoptosis.11 FH severity is graded, crucial for prognosis and predicting visual acuity,11,13,14 especially in young patients with nystagmus.14 A recent multicenter observational study characterized phenotypic and genotypic spectrum, finding that genotypic correlations with foveal development and its consequence to vision.13 A spectrum of FH has been identified in albinism and PAX6 variants, whereas SLC38A8 and AHR variants were consistently associated with high grades of FH. FRMD7 variants were associated with grade 1 FH or normal foveal morphology, which had significantly better median visual acuity.13 We summarized the phenotypes and genotypes reported to date in Supplementary Table S1
Optical coherence tomography (OCT) transformed retinal diagnosis with a prompt and noninvasive method15,16 but requires expertise to ensure precise grading and distinguishing normal foveal development from subtle defects in atypical FH. Identifying foveal abnormalities is critical, particularly in nonspecialist centers that do not have access to OCT. Retinal fundus images have become vital tools for evaluating patients' eyes, not only for the general population but also for pediatric cases who are less cooperative and in conditions with progressive diseases requiring frequent follow-up. 
Deep learning (DL) is a subset of artificial intelligence (AI) focused on algorithms inspired by the structure and function of the brain called artificial neural networks. Convolutional neural network (CNN), a class of deep neural networks highly effective for analyzing visual imagery. CNNs are structured in layers, each of which aims to recognize different hierarchical features of the input images. The initial layers focus on low-level features such as edges and textures, while the deeper layers identify more complex and abstract features such as shapes and specific pathological structures.17,18 This hierarchical feature extraction enables a comprehensive understanding of the images by facilitating the nuanced detection and classification of various pathological features without explicit manual feature extraction by human experts. DL is widely used in medical imaging, including ophthalmology, covering fundus photography, visual fields, and adult OCT.1926 Fundus images have identified eye diseases23,26 and their systemic connections.24,25 Recent studies show DL's high accuracy in distinguishing sight-threatening papilledema from pseudopapilledema in fundus photography, outperforming clinicians in diagnosis.20 DL with region guidance showed reliable performance for the detection of multiple findings in retinal fundus images that rivaled that of human experts, especially in the detection of hemorrhages, hard exudates, membranes, macular holes, myelinated nerve fibers, and glaucomatous disc changes.27 
We therefore aimed to develop a system to grade foveal hypoplasia in retinal fundus images by implementing a multi-layered CNN model to learn and accurately detect the features. We evaluated our model's performance and compared it with human grading to assess its effectiveness. Our goal was to enhance grading accuracy and consistency, reduce human grading associated subjectivity, and provide a scalable solution for its widespread clinical use. 
Material and Methods
Study Population
We retrospectively included patients with foveal developmental disorders who underwent retinal fundus imaging examinations and optical coherence tomography (OCT) at any time between January 1, 2001, and August 31, 2021, at Chang Gung Memorial Hospital (CGMH), Linkou Medical Center, Taoyuan, Taiwan. The retinal fundus images were captured with fundus cameras (Digital Non-Mydriatic Retinal Camera; Canon, Inc., Tokyo, Japan). The OCT data was acquired (Carl Zeiss Meditec, Jena, Germany) over the study period (from 2001 to 2021) during follow-up. This study was approved by the CGMH Institutional Review Board (CGMH IRB No. 202300036B0); the requirement for informed consent was waived because patient data were anonymized. The study was conducted in accordance with the Declaration of Helsinki. 
Inclusion and Exclusion Criteria
We included all diagnoses (including albinism, aniridia, ROP, incontinentia pigmenti, achromatopsia, optic nerve hypoplasia, familial exudative vitreoretinopathy, persistent fetal vasculature syndrome) known to be associated with FH and did not subcategorize based on genetics owing to its rarity.11 Previous literature suggest that a critical element to determining prognosis is the retinal structure, irrespective of subcategories of genetic diagnosis.11,14 Demographic data including age (nonrestricted) and sex were also retrieved from the electronic medical records system. Baseline fundus images and OCT scans were acquired. Exclusion criteria included poor quality scans or images where fovea was ungradable, as determined by experts other than clinician graders before the conduction of grading, and data from participants below the age of one year. The age cutoff was based on previous literature because an immature outer retinal structure and foveal specialization is not sufficient for reliable grading.12,2830 All clinical information, along with fundus images, was analyzed in an anonymous and deidentified manner. Both clinicians and scientist were blinded to the data. 
Definition of Foveal Development Abnormalities
The FH grade was determined according to the FH grading system,11 which is based on the unique developmental processes occurring at the fovea. A four-grade system is used to classify a typical FH. Grade 1 is characterized by the continuation of the inner retinal layers, the presence of a foveal pit, outer segment layer lengthening, and outer nuclear layer widening. Grade 2 has all features of grade 1, except there is no foveal pit. Grade 3 possesses features of grade 2 except outer segment lengthening. Grade 4 is a more advanced form of FH without the occurrence of all the aforementioned processes and shares all grade 3 features except outer nuclear layer widening. Therefore grade 4 may mimic the appearance of a peripheral retina. Other than the above typical grading, atypical FH is characterized by presence of photoreceptor degradation with disruption of the inner segment ellipsoid. The grading was confirmed by the retinal specialist (other than the clinical grader), WCW, and carefully cross-checked by another co-author retinal specialist (EYK). 
Definition of the Best-Corrected Visual Acuity
The best-corrected visual acuity (BCVA) was measured through any available method (with prepared glasses, using a Snellen E chart (Landolt C or other Snellen [not E]) chart) at 6 m, and the result was converted to a logMAR VA. We used a logMAR visual acuity score of 2 to denote the counting fingers vision test and logMAR scores of 2.3, 2.8, and 3 to signify hand movement, light perception, and no light perception, respectively, as per previous literature.31 
Study Design
This study mainly had two stages of classification of foveal development: (1) binary and (2) six-grade. The binary classification was set to differentiate normal foveal development from the abnormal one based on the OCT findings. On the second stage, six-grade classification further subdivided the abnormal foveal status to grade 1-4 FH and atypical FH. It is important to clarify that OCT scans were solely used to categorize fundus images for the purpose of creating labels for training and validation. These fundus images, once categorized, served as the sole input for the CNN model. OCT data were not used as input data in any deep learning processes. Ground truth for each fundus image classification was established based on its corresponding OCT categorization as shown in Figure 1. To achieve the highest validation accuracy, seven CNN models were evaluated in both binary and six-grade classification tasks using this fundus image dataset. 
Figure 1.
 
Representative OCT and color fundus images of each grade of foveal hypoplasia in the dataset.
Figure 1.
 
Representative OCT and color fundus images of each grade of foveal hypoplasia in the dataset.
Data Augmentation
In this study, every group of fundus photography image was augmented with a different augmentation rate to balance the number of images in different groups. The datasets included a proportional representation of both normal and pathological images to rigorously test the model's accuracy at each evaluation stage. The number of raw images and images after augmentation is given in Supplementary Table S2. The data augmentation methods selected—rotation, color shift, and flip32—were carefully chosen to preserve the clinical relevance of the images and reflect realistic scenarios encountered in ophthalmic practice. Every raw image was randomly used with one of the aforementioned methods to produce one new image, and such a process was repeated multiple times, where the repetition times were equal to the augmentation rate. 
Constructed Procedure of the Foveal Hypoplasia Prediction Model
The constructed procedure of the proposed prediction model is illustrated in Figure 2. First, we selected the seven ImageNet pre-trained deep learning models, that is, EfficientNet_b0, EfficientNet_b1, Vgg16, Vgg19, ResNet-18, ResNet-50, and ConvNeXt_base, which have been frequently used in classification, as the candidates for the FH prediction model. The seven candidate-models were then fine-tuned using the augmented training dataset with fivefold cross-validation and the parameters were optimized using random search during fine-tuning. Although advanced deep learning models like transformers and deeper ResNet-based models perform well in many tasks, they were not used. The training data was too small for these models to achieve optimal performance. 
Figure 2.
 
The diagram of the constructed procedure for the proposed prediction model.
Figure 2.
 
The diagram of the constructed procedure for the proposed prediction model.
The range of the parameters for the random search during fine-tuning is provided in Supplemental Table S3. During the fine-tuning, we randomly searched approximately 450 different parameter sets for each of the seven models. All fine-tuned models were tested using the testing dataset to select the best FH prediction model. Additionally, the best parameter sets of the prediction models are listed in Supplemental Table S4
Evaluation of the Prediction Performance for the Proposed Deep Learning Models
For the binary classification, the five typical evaluation indexes, also known as accuracy, sensitivity (also known as recall), specificity, precision, and F1 score, were used to assess the prediction performance of the proposed deep learning models. These evaluation indexes and their respective definitions are listed below:  
\begin{eqnarray*}{\rm{Accurac}}{{{\rm{y}}}_{{\rm{binary}}}} = \left( {{\rm{TP}} + {\rm{TN}}} \right)/\left( {{\rm{TP}} + {\rm{FP}} + {\rm{TN}} + {\rm{FN}}} \right)\end{eqnarray*}
 
\begin{eqnarray*}{\rm{Sensitivity\ }}\left( {{\rm{recall}}} \right) = {\rm{TP}}/\left( {{\rm{TP}} + {\rm{FN}}} \right)\end{eqnarray*}
 
\begin{eqnarray*}{\rm{Specificity}} = {\rm{TN}}/\left( {{\rm{TN}} + {\rm{FP}}} \right)\end{eqnarray*}
 
\begin{eqnarray*}{\rm{Precision}} = {\rm{TP}}/\left( {{\rm{TP}} + {\rm{FP}}} \right)\end{eqnarray*}
 
\begin{eqnarray*} && {\rm{F}}1{\rm{\ score}} = 2 \times {\rm{sensitivity}} \times {\rm{precision}}/ \\ && \left( {{\rm{sensitivity}} + {\rm{precision}}} \right)\end{eqnarray*}
where TP is the number of true-positives, TN is the number of true-negatives, FP is the number of false-positives, and FN is the number of false-negatives. 
For the multiclass classification, we used the overall accuracy and balanced accuracy, which were defined as follows:  
\begin{eqnarray*}{\rm{Accurac}}{{{\rm{y}}}_{{\rm{multi}}}} = \sum \limits_{i = 0}^5 T{{P}_i}/\mathop \sum \limits_{i = 0}^5 number{\rm{\ }}\textit{of}{\rm{\ }}clas{{s}_i} \end{eqnarray*}
 
\begin{eqnarray*} && {\rm{Balanced\ accuracy}} \\ && =\sum \limits_{i = 0}^5 \frac{{TPi}}{{TPi + FNi}}/number{\rm{\ }}\textit{of}{\rm{\ }}class{\rm{es}} \end{eqnarray*}
 
For model evaluation, we calculated the area under the precision and recall curve (AUPRC) and the area under the receiver operating characteristic curve (AUROC) from the validation and test datasets. 
Comparisons of the Model's Grading Against Human Grading
To assess the model's grading against human graders, six clinicians with varying retinal image interpretation experience were selected. Junior clinicians consisted of trainees (ophthalmology residents) with less than three years of OCT reading practice, while senior clinicians were retinal specialists (ophthalmology attendings) with over three years of experience. All clinicians, blinded to clinical details, evaluated the testing dataset for accuracy, sensitivity, specificity, precision, F1 score in binary classification, and six-grade classification. Agreement between graders and true values was assessed using kappa statistics and 95 percent confidence intervals, indicating different levels of agreement. 
Results
Dataset Characteristics
In this study, we included 303 patients (male, 55.1%; female, 44.9%) with 605 retinal fundus images. The ages of patients ranged between 2 and 80 years (mean ± standard deviation, 13.33 ± 14.30 years). The majority of cases were children (263 out of 303, 86.8%) who were under 18 years old. The BCVA (logMAR VA scores) of each group is presented in Table 1. Generally, higher grade of FH showed less favorable visual acuity, as grade 3 and grade 4 revealed significantly worse logMAR VA scores than the early grades of FH (grade 0, 1, and 2). Atypical FH also had poor BCVA, and mean BCVA (log MAR scores) was graded between grade 3 and 4 FH. 
Table 1.
 
The BCVA of Eyes With Corresponding Foveal Status
Table 1.
 
The BCVA of Eyes With Corresponding Foveal Status
Within the dataset, 360 (59.5%) retinal fundus images were categorized as normal foveal development (grade 0), 154 (25.5%) as grade 1 foveal hypoplasia, 11 (1.8%) as grade 2 foveal hypoplasia, 8 (1.3%) grade 3 foveal hypoplasia, 50 (8.3%) as grade 4 foveal hypoplasia, and 22 (3.6%) as atypical foveal hypoplasia. 
The training, validation, and testing sets comprised 1229, 527, and 179 images, respectively, as mentioned in the methods section after data augmentation. 
Model Performance
Top five averaged accuracy of different CNNs are listed in Table 2. Overall, the validation accuracy of each model ranged from 88.12% to 90.58% in binary classification and 84.71% to 90.59% in six-grade classification. In our comparative analysis, the EfficientNet_b1 model emerged as the most effective, delivering superior accuracy rates across both validation and testing phases. Specifically, during testing, EfficientNet_b1 achieved an accuracy of 84.36% in the binary classification of foveal development, and 78.21% in the more complex six-grade classification system. These tables (Table 2A, 2B) underscore EfficientNet_b1 as the optimally tuned model for our dataset, demonstrating its robust capability for handling the varied challenges presented by our classification tasks. 
Table 2.
 
Top Five Averaged Accuracy of Different Convolutional Neural Networks
Table 2.
 
Top Five Averaged Accuracy of Different Convolutional Neural Networks
The AUROC and AUPRC of the validation and test sets of the EfficientNet_b1 model in binary classification are shown in Figure 3. In the validation set, the model achieved an AUROC of 0.9441 (P < 0.0001) (Fig. 3A) and AUPRC of 0.9654 (P < 0.0001) (Fig. 3B). In the testing set, the model also achieved an AUROC of 0.8777 (P < 0.0001) (Fig. 3C) and AUPRC of 0.8327 (P < 0.0001) (Fig. 3D). The AUROC of other different models were displayed in Supplementary Figure S1
Figure 3.
 
The AUROC and AUPRC for the validation and test sets of the EfficientNet_b1 model. (A) ROC curve for the validation set. (B) Precision-recall curve for the validation set. (C) ROC curve for the test set. (D) Precision-recall curve for the test set.
Figure 3.
 
The AUROC and AUPRC for the validation and test sets of the EfficientNet_b1 model. (A) ROC curve for the validation set. (B) Precision-recall curve for the validation set. (C) ROC curve for the test set. (D) Precision-recall curve for the test set.
Heatmaps in the CNN (EfficientNet_b1) Model
Representative color fundus photos with heatmaps are presented in Figure 4, in which the regions responsible for the prediction of each grade of FH are highlighted in warm colors. These heatmaps allow us to visualize the model's decision-making process. The focus regions consistently appeared in the macular area, with warmer colors centered on the fovea and a gradient extending toward cooler tones as the focus radiated from the parafovea, perifovea, and macula to the periphery. 
Figure 4.
 
Visualization of heatmaps generated by Grad-CAM in a CNN (EfficientNet_b1) model. (A) Heatmaps of normal and abnormal foveal status. The upper row constitutes representative color fundus images (left), with a heatmap (middle) depicting the areas under the machine's focus. The rightmost pictures are superimposed heatmaps after the generation of all heatmaps in each group. (B) The representative color fundus images (left) with heatmaps (middle) and superimpose heatmaps (right) from six grades of foveal hypoplasia.
Figure 4.
 
Visualization of heatmaps generated by Grad-CAM in a CNN (EfficientNet_b1) model. (A) Heatmaps of normal and abnormal foveal status. The upper row constitutes representative color fundus images (left), with a heatmap (middle) depicting the areas under the machine's focus. The rightmost pictures are superimposed heatmaps after the generation of all heatmaps in each group. (B) The representative color fundus images (left) with heatmaps (middle) and superimpose heatmaps (right) from six grades of foveal hypoplasia.
For Grades 1 and 2 FH, characterized by subtle anatomical differences such as the presence or absence of a foveal pit, heatmaps typically highlight these central regions with warm colors, indicating high relevance to the model's classification decisions. As FH progresses to Grades 3 and 4, in which more significant anatomical disruptions occur, the heatmap activity shifts to illustrate these structural anomalies, such as the absence of outer segment lengthening or outer nuclear layer widening. This transition is visually represented by a change in heatmap intensity and spread, moving from concentrated central activity in early grades to more dispersed and varied patterns in severe cases. For atypical FH, which involves photoreceptor degradation and other complex changes, the heatmaps display a more scattered pattern, reflecting the irregular and unpredictable nature of these pathological features. 
Misclassification may occur by the model prediction; examples are shown in Figure 5. The heatmaps reveal different focus patterns in these cases, such as shifting nasally to the macular area (Fig. 5A) and focusing along the main retinal vessels (Figs. 5B and 5C), or around the disc area (Fig. 5D), affecting the model's ability to accurately grade FH. 
Figure 5.
 
Examples of heatmaps with misclassification generated by Grad-CAM in the CNN (EfficientNet_b1) model. (A) The patient's left eye (normal foveal status) was misclassified as having grade 1 FH. The heatmap shifts to the zone nasal to the macula. (B) The patient's left eye (normal foveal status) was misclassified as having grade 3 FH. The hotspot focuses along the main retinal vessels of the inferior arcade and shows an enlarging square zone of the heatmap. (C) The patient's right eye (grade 4 FH) was misclassified as having grade 3 FH. The heatmap focuses along the main retinal vessels of the superior arcade without a definite central hotspot, as shown by others. (D) The patient's right eye (atypical FH) was misclassified as having a normal foveal status. The hotspot focuses around the disc area.
Figure 5.
 
Examples of heatmaps with misclassification generated by Grad-CAM in the CNN (EfficientNet_b1) model. (A) The patient's left eye (normal foveal status) was misclassified as having grade 1 FH. The heatmap shifts to the zone nasal to the macula. (B) The patient's left eye (normal foveal status) was misclassified as having grade 3 FH. The hotspot focuses along the main retinal vessels of the inferior arcade and shows an enlarging square zone of the heatmap. (C) The patient's right eye (grade 4 FH) was misclassified as having grade 3 FH. The heatmap focuses along the main retinal vessels of the superior arcade without a definite central hotspot, as shown by others. (D) The patient's right eye (atypical FH) was misclassified as having a normal foveal status. The hotspot focuses around the disc area.
Comparison of CNN (EfficientNet_b1) Performance Against Clinician Performance
The binary classification (normal vs. abnormal) achieved a validation accuracy of 90.17% and a best testing accuracy of 84.36% with the EfficientNet_b1 model, with a 84.51% sensitivity and a 84.26% specificity (Table 3A). The six-grade classification achieved a validation accuracy of 90.36% and a best testing overall accuracy of 78.21% and balanced accuracy of 59.35% (Table 3B). 
Table 3A.
 
Comparison of CNN (EfficientNet_b1) Diagnostic Performance Against Clinician's Diagnostic Performance: Binary Classification
Table 3A.
 
Comparison of CNN (EfficientNet_b1) Diagnostic Performance Against Clinician's Diagnostic Performance: Binary Classification
Table 3B.
 
Comparison of CNN (EfficientNet_b1) Diagnostic Performance Against Clinician's Diagnostic Performance: Six-Grade Classification of Foveal Hypoplasia
Table 3B.
 
Comparison of CNN (EfficientNet_b1) Diagnostic Performance Against Clinician's Diagnostic Performance: Six-Grade Classification of Foveal Hypoplasia
The accuracy among clinicians ranged from 40.54% to 54.05% for junior clinicians and 35.14% to 43.24% for senior clinicians, showing similar levels of accuracy in binary classification (Table 3A). While the diagnostic sensitivity was high, ranging from 92.31% to 100%, the specificity was notably low, ranging from 0.00% to 29.17%. In contrast, EfficientNet_b1 demonstrated near-perfect agreement with the true diagnosis (κ statistic, 0.83; 95% confidence interval, 0.73 to 0.92). Junior and senior clinicians, however, exhibited significantly lower and more variable performance in accuracy, precision, and F1 score in binary classification (κ statistic, −0.05 to 0.22), indicating no or only slight agreement with the true grading (Table 3A). 
The overall accuracy among clinicians ranged from 10.81% to 35.13%, and balanced accuracy ranged from 7.64% to 47.92%, with both junior and senior clinicians showing similarly low levels of accuracy in the six-grade classification. Despite the clinical experience difference between them, their performance did not show any discernible trend (Table 3B). Similarly, in six-grade classification, comparisons of EfficientNet_b1 diagnostic accuracy (κ statistic, 0.58; 95 percent confidence interval, 0.45 to 0.71) against clinicians showed globally significantly better agreement with the true in performance, compared to junior and senior clinicians (κ statistic, −0.002 to 0.23) (Table 3B). 
Discussion
In this study, we developed a deep learning grading system to distinguish normal and abnormal foveal structures via retinal fundus images. After rigorous testing and comparison of various CNNs, EfficientNet_b1 model emerged as the top-performing algorithm, showcasing high diagnostic accuracy, sensitivity, and specificity in both binary and six-grade classifications. Notably, our grading system outperformed both junior and senior clinicians. 
Grading the FH provides significant diagnostic and prognostic value.9,1114 Our results aligned with the findings by Thomas et al.,11 indicating that a higher grade of FH was associated with progressively poorer VA. However, interpretation of results from the fundus images or foveal scans requires high expertise and training by clinicians. Grading of FH could help differentiate different diseases9,1114 and aid in the referral to ocular genetics for target genetic testing, which could further reduce time and financial costing. Given the low accuracy and sensitivity of genetic testing for FH,33,34 services lacking an inherited retinal dystrophy or ocular genetics specialist would undoubtedly benefit the most. 
Our system, trained and validated on retinal fundus images, offers high accessibility and can be operated by untrained medical staff. Capturing a patient's retinal images takes less than 10 minutes, allowing prompt specialist referral upon issue detection.35 Prior studies regarding deep learning in ophthalmology showed fundus images' utility in identifying eye diseases (e.g., glaucoma, macular degeneration, refractive errors and diabetic retinopathy),36 as well as systemic factors like cardiovascular risk25 and renal function impairment.24 These findings motivate us to expand deep learning applications across various retinal diseases. 
Traditionally, OCT scans of the fovea were the primary source for FH detection and grading, considered the study's ground truth. Our study utilized retinal fundus images, aligned with OCT scans in our dataset, as the learning target materials. Human interpreters, even when well trained, might miss subtle retinal image details, leading to potential misinterpretation and misdiagnosis. Because of the rarity of the disease and the difficulty in interpretation, errors in grading may occur. For example, OCT measurements may not encompass the extent of detectable disease, or inexperienced interpreters may assign grades that are either too high or too low or even regarded as normal most of the time, especially when FH is the sole clinical manifestation of the condition. Our system excels in overcoming these challenges, demonstrating high accuracy in both binary and six-grade classifications. The EfficientNet_b1 model in our study had excellent classification performance (84.36% in binary classification, 78.21% in six-grade classification). The superimposed heatmaps revealed that the center of the retinal fundus, also referred to as the “macula” area, was influential in the determination of the FH grading. It is reasonable that the results could be fully explained by the fovea, situated at the center of the macula, corresponding to the model of interest. Prior studies disclosed a distinct sign termed as “concentric macular rings” in patients with foveal hypoplasia on ultra-widefield fundus photography.3740 Moreover, they found a significant correlation between the horizontal diameter of the largest outer ring and foveal hypoplasia grades. These suggested potential identifiable features on the macula in fundus photography, which are not currently discernible to the human eye, may aid in differentiating the grade of FH in our model. On the contrary, those with misclassification shared different patterns on the heatmap of each testing. The focus region farther from the macular area would possibly lead to wrong prediction, which was also supported by our aforementioned findings. Therefore Grad-CAM approaches41 provided visual explanations via heatmaps, aiding interpretation of neural network learning and reducing deep learning's black-box nature concerns. 
The EfficientNet_b1 model outperformed junior and senior clinicians in binary classification, excelling in accuracy, precision, and F1 score with high sensitivity and specificity. Interestingly, there was little difference in accuracy between junior and senior clinicians, suggesting that clinician experience did not significantly impact grading accuracy in this context. This finding underscores the inherent challenge of grading FH based solely on retinal fundus images, irrespective of the clinician's experience level. The variability in human grading can be attributed to several factors. First, the subtle differences in retinal images that indicate various grades of FH are often difficult to discern even for experienced clinicians. This inherent difficulty can lead to inconsistencies and errors in manual grading. Second, the lack of standardized training and criteria for grading FH can result in subjective interpretations, further contributing to variability. This suggests potential for our intelligent system to assist FH grading or serve as a screening tool for distinguishing normal and abnormal fovea in patients assessable only through retinal fundus images. 
Previous deep learning studies in adults have shown high performance,1926 but there's a gap in research focused on pediatric populations, especially regarding pediatric retinal disorders that may change over time. Assessing pediatric patients is challenging due to reliance on objective findings and limited communication or cooperation from young children or infants. Developing pediatric deep learning systems is crucial to aid clinical assessment, especially with the increasing use of hand-held devices, making diagnosis more accessible for previously hard-to-cooperate populations like infants and young children. Future research should test our findings using fundus images from various devices, including hand-held and table-mounted systems. 
Although our model demonstrated good performance, our study had several limitations. First, its retrospective design might have missed potential patient candidates. Different methods of visual acuity assessment could pose limitations despite conversion to logMAR. Second, the lack of serial follow-up data results in incomplete information about disease progression. Third, excluding missing or poor-quality images could limit the generalizability, as clinical settings often encounter such images affected by factors like patient cooperation, small pupil or medial opacities. Further investigations through prospective clinical trials in real-world settings are necessary to validate the model's real-time diagnostic performance, especially in clinical scenarios involving more complex foveal pathologies. Fourth, an imbalance in population sizes across different grade classifications during training was addressed through data augmentation to minimize its impact. Fifth, patient-matching wasn't conducted during training, validation, and testing, potentially affecting our model's performance because of varying clinical characteristics among groups. Sixth, this study was conducted in a real-world setting and emphasized a population with restricted ethnic backgrounds and a specific genetic pool. This limitation raises concerns about potential bias, as the model may not perform as well in populations with different genetic and environmental factors. To ensure the generalizability of our findings, further validation in more diverse ethnic groups and clinical settings is essential and would validate its robustness and applicability across various populations. Finally, heatmaps in deep learning are limited by their lack of precision, potential for overgeneralization, and sensitivity to input variations, which may lead to inconsistent or misleading interpretations. Despite these limitations, we believe our results offer a practical deep learning model for FH diagnosis in this population. 
In conclusion, our study formulated and evaluated a deep learning model for grading and distinguishing normal and abnormal foveal development using retinal fundus images. Compared to human graders, our model demonstrated superior efficiency and accuracy in identifying foveal developmental diseases. With the increasing scan volume and workload, easily obtained color fundus images offer a feasible option for detection in pediatric and adult populations. We anticipate that future clinical implementation could provide deeper insight into the model's utility and reliability in diverse clinical scenarios. We hope our deep learning model marks substantial progress in foveal hypoplasia diagnosis, aiding future evaluation and management of foveal developmental diseases. 
Acknowledgments
The authors thank Pin-Hsuan Huang for the statistical consultation and wish to acknowledge for statistical and data analysis assistance and interpretation by the Center for Big Data Analytics and Statistics, Chang Gung Memorial Hospital, Linkou. 
Supported by Chang Gung Memorial Hospital Research Grants (CMRPG3L0151∼3, CMRPG3M0131∼2, and CORPD2J0073) and a National Science Council Research Grant (MOST 109-2314-B-182A-019-MY3). The sponsor had no role in the design or conduct of this research. The authors declare that they have no conflicting interests. 
Disclosure: T.-Y. Tsai, None; Y.-F. Chang, None; E.Y.-C. Kang, None; K.-J. Chen, None; N.-K. Wang, None; L. Liu, None; Y.-S. Hwang, None; C.-C. Lai, None; S.-Y. Chen, None; J. Chen, None; C.-S. Lai, None; W.-C. Wu, None 
References
Hendrickson AE, Yuodelis C. The morphological development of the human fovea. Ophthalmology. 1984; 91: 603–612. [CrossRef] [PubMed]
Yuodelis C, Hendrickson A. A qualitative and quantitative analysis of the human fovea during development. Vision Res. 1986; 26: 847–855. [CrossRef] [PubMed]
McAllister JT, Dubis AM, Tait DM, et al. Arrested development: high-resolution imaging of foveal morphology in albinism. Vision Res. 2010; 50: 810–817. [CrossRef] [PubMed]
Hammer DX, Iftimia NV, Ferguson RD, et al. Foveal fine structure in retinopathy of prematurity: an adaptive optics Fourier domain optical coherence tomography study. Invest Ophthalmol Vis Sci. 2008; 49: 2061–2070. [CrossRef] [PubMed]
Thomas MG, McLean RJ, Kohl S, Sheth V, Gottlob I. Early signs of longitudinal progressive cone photoreceptor degeneration in achromatopsia. Br J Ophthalmol. 2012; 96: 1232–1236. [CrossRef] [PubMed]
Thomas MG, Kumar A, Kohl S, Proudlock FA, Gottlob I. High-resolution in vivo imaging in achromatopsia. Ophthalmology. 2011; 118: 882–887. [CrossRef] [PubMed]
Hingorani M, Williamson KA, Moore AT, van Heyningen V. Detailed ophthalmologic evaluation of 43 individuals with PAX6 mutations. Invest Ophthalmol Vis Sci. 2009; 50: 2581–2590. [CrossRef] [PubMed]
Mayer AK, Mahajnah M, Thomas MG, et al. Homozygous stop mutation in AHR causes autosomal recessive foveal hypoplasia and infantile nystagmus. Brain. 2019; 142: 1528–1534. [CrossRef] [PubMed]
Kuht HJ, Han J, Maconachie GDE, et al. SLC38A8 mutations result in arrested retinal development with loss of cone photoreceptor specialization. Hum Mol Genet. 2020; 29: 2989–3002. [CrossRef] [PubMed]
Poulter JA, Al-Araimi M, Conte I, et al. Recessive mutations in SLC38A8 cause foveal hypoplasia and optic nerve misrouting without albinism. Am J Hum Genet. 2013; 93: 1143–1150. [CrossRef] [PubMed]
Thomas MG, Kumar A, Mohammad S, et al. Structural grading of foveal hypoplasia using spectral-domain optical coherence tomography a predictor of visual acuity? Ophthalmology. 2011; 118: 1653–1660. [CrossRef] [PubMed]
Thomas MG, Papageorgiou E, Kuht HJ, Gottlob I. Normal and abnormal foveal development. Br J Ophthalmol. 2022; 106: 593–599. [CrossRef] [PubMed]
Kuht HJ, Maconachie GDE, Han J, et al. Genotypic and phenotypic spectrum of foveal hypoplasia: a multicenter study. Ophthalmology. 2022; 129: 708–718. [CrossRef] [PubMed]
Rufai SR, Thomas MG, Purohit R, et al. Can structural grading of foveal hypoplasia predict future vision in infantile nystagmus? A longitudinal study. Ophthalmology. 2020; 127: 492–500. [CrossRef] [PubMed]
Fujimoto J, Swanson E. The development, commercialization, and impact of optical coherence tomography. Invest Ophthalmol Vis Sci. 2016; 57(9): OCT1–OCT13. [CrossRef] [PubMed]
Huang D, Swanson EA, Lin CP, et al. Optical coherence tomography. Science. 1991; 254(5035): 1178–1181. [CrossRef] [PubMed]
Sarvamangala DR, Kulkarni RV. Convolutional neural networks in medical image understanding: a survey. Evol Intell. 2022; 15: 1–22. [CrossRef] [PubMed]
Mall PK, Singh PK, Srivastav S, et al. A comprehensive review of deep neural networks for medical image processing: recent developments and future opportunities. Healthc Anal. 2023; 4: 100216. [CrossRef]
Kucur SS, Hollo G, Sznitman R. A deep learning approach to automatic detection of early glaucoma from visual fields. PLoS One. 2018; 13(11): e0206081. [CrossRef] [PubMed]
Milea D, Najjar RP, Zhubo J, et al. Artificial intelligence to detect papilledema from ocular fundus photographs. N Engl J Med. 2020; 382: 1687–1695. [CrossRef] [PubMed]
Ting DSW, Pasquale LR, Peng L, et al. Artificial intelligence and deep learning in ophthalmology. Br J Ophthalmol. 2019; 103: 167–175. [CrossRef] [PubMed]
Zhen Y, Chen H, Zhang X, et al. Assessment of central serous chorioretinopathy depicted on color fundus photographs using deep learning. Retina. 2020; 40: 1558–1564. [CrossRef] [PubMed]
Gulshan V, Peng L, Coram M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016; 316: 2402–2410. [CrossRef] [PubMed]
Kang EY, Hsieh YT, Li CH, et al. Deep learning-based detection of early renal function impairment using retinal fundus images: model development and validation. JMIR Med Inform. 2020; 8(11): e23472. [CrossRef] [PubMed]
Poplin R, Varadarajan AV, Blumer K, et al. Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning. Nat Biomed Eng. 2018; 2: 158–164. [CrossRef] [PubMed]
Ting DSW, Cheung CY, Lim G, et al. Development and validation of a deep learning system for diabetic retinopathy and related eye diseases using retinal images from multiethnic populations with diabetes. JAMA. 2017; 318: 2211–2223. [CrossRef] [PubMed]
Son J, Shin JY, Kim HD, Jung KH, Park KH, Park SJ. Development and validation of deep learning models for screening multiple abnormal findings in retinal fundus images. Ophthalmology. 2020; 127: 85–94. [CrossRef] [PubMed]
Hendrickson A, Possin D, Vajzovic L, Toth CA. Histologic development of the human fovea from midgestation to maturity. Am J Ophthalmol. 2012; 154: 767–778. [CrossRef] [PubMed]
Lee H, Purohit R, Patel A, et al. In vivo foveal development using optical coherence tomography. Invest Ophthalmol Vis Sci. 2015; 56: 4537–4545. [CrossRef] [PubMed]
Vajzovic L, Hendrickson AE, O'Connell RV, et al. Maturation of the human fovea: correlation of spectral-domain optical coherence tomography findings with histology. Am J Ophthalmol. 2012; 154: 779–789. [CrossRef] [PubMed]
Schulze-Bonsel K, Feltgen N, Burau H, Hansen L, Bach M. Visual acuities “hand motion” and “counting fingers” can be quantified with the freiburg visual acuity test. Invest Ophthalmol Vis Sci. 2006; 47: 1236–1240. [CrossRef] [PubMed]
Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019; 60(6): 1–48.
Jiang Y, Li S, Xiao X, Sun W, Zhang Q. Genotype-phenotype of isolated foveal hypoplasia in a large cohort: minor iris changes as an indicator of PAX6 involvement. Invest Ophthalmol Vis Sci. 2021; 62(10): 23. [CrossRef] [PubMed]
Ehrenberg M, Bagdonite-Bejarano L, Fulton AB, Orenstein N, Yahalom C. Genetic causes of nystagmus, foveal hypoplasia and subnormal visual acuity- other than albinism. Ophthalmic Genet. 2021; 42: 243–251. [CrossRef] [PubMed]
Beede E, Baylor E, Hersch F, et al. A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. CHI Conference on Human Factors in Computing Systems. Honolulu, HI, USA; 2020.
Ting DSW, Peng L, Varadarajan AV, et al. Deep learning in ophthalmology: the technical and clinical considerations. Prog Retin Eye Res. 2019; 72: 100759. [CrossRef] [PubMed]
Cornish KS, Reddy AR, McBain VA. Concentric macular rings sign in patients with foveal hypoplasia. JAMA Ophthalmol. 2014; 132(9): 1084–1088. [CrossRef] [PubMed]
Ramtohul P, Comet A, Denis D. Multimodal imaging correlation of the concentric macular rings sign in foveal hypoplasia: a distinctive henle fiber layer geometry. Ophthalmol Retina. 2020; 4: 946–953. [CrossRef] [PubMed]
Ramtohul P, Denis D. Concentric macular rings sign in Chediak-Higashi syndrome. Ophthalmology. 2019; 126: 1616. [CrossRef] [PubMed]
Sisk RA, Parekh PK, Riemann CD. Fingerprint macula artifact on optos fundus imaging in nystagmus. Ophthalmology. 2020; 127: 96. [CrossRef] [PubMed]
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: visual explanations from deep networks via gradient-based localization. Int J Comput Vis. 2020; 128: 336–359. [CrossRef]
Figure 1.
 
Representative OCT and color fundus images of each grade of foveal hypoplasia in the dataset.
Figure 1.
 
Representative OCT and color fundus images of each grade of foveal hypoplasia in the dataset.
Figure 2.
 
The diagram of the constructed procedure for the proposed prediction model.
Figure 2.
 
The diagram of the constructed procedure for the proposed prediction model.
Figure 3.
 
The AUROC and AUPRC for the validation and test sets of the EfficientNet_b1 model. (A) ROC curve for the validation set. (B) Precision-recall curve for the validation set. (C) ROC curve for the test set. (D) Precision-recall curve for the test set.
Figure 3.
 
The AUROC and AUPRC for the validation and test sets of the EfficientNet_b1 model. (A) ROC curve for the validation set. (B) Precision-recall curve for the validation set. (C) ROC curve for the test set. (D) Precision-recall curve for the test set.
Figure 4.
 
Visualization of heatmaps generated by Grad-CAM in a CNN (EfficientNet_b1) model. (A) Heatmaps of normal and abnormal foveal status. The upper row constitutes representative color fundus images (left), with a heatmap (middle) depicting the areas under the machine's focus. The rightmost pictures are superimposed heatmaps after the generation of all heatmaps in each group. (B) The representative color fundus images (left) with heatmaps (middle) and superimpose heatmaps (right) from six grades of foveal hypoplasia.
Figure 4.
 
Visualization of heatmaps generated by Grad-CAM in a CNN (EfficientNet_b1) model. (A) Heatmaps of normal and abnormal foveal status. The upper row constitutes representative color fundus images (left), with a heatmap (middle) depicting the areas under the machine's focus. The rightmost pictures are superimposed heatmaps after the generation of all heatmaps in each group. (B) The representative color fundus images (left) with heatmaps (middle) and superimpose heatmaps (right) from six grades of foveal hypoplasia.
Figure 5.
 
Examples of heatmaps with misclassification generated by Grad-CAM in the CNN (EfficientNet_b1) model. (A) The patient's left eye (normal foveal status) was misclassified as having grade 1 FH. The heatmap shifts to the zone nasal to the macula. (B) The patient's left eye (normal foveal status) was misclassified as having grade 3 FH. The hotspot focuses along the main retinal vessels of the inferior arcade and shows an enlarging square zone of the heatmap. (C) The patient's right eye (grade 4 FH) was misclassified as having grade 3 FH. The heatmap focuses along the main retinal vessels of the superior arcade without a definite central hotspot, as shown by others. (D) The patient's right eye (atypical FH) was misclassified as having a normal foveal status. The hotspot focuses around the disc area.
Figure 5.
 
Examples of heatmaps with misclassification generated by Grad-CAM in the CNN (EfficientNet_b1) model. (A) The patient's left eye (normal foveal status) was misclassified as having grade 1 FH. The heatmap shifts to the zone nasal to the macula. (B) The patient's left eye (normal foveal status) was misclassified as having grade 3 FH. The hotspot focuses along the main retinal vessels of the inferior arcade and shows an enlarging square zone of the heatmap. (C) The patient's right eye (grade 4 FH) was misclassified as having grade 3 FH. The heatmap focuses along the main retinal vessels of the superior arcade without a definite central hotspot, as shown by others. (D) The patient's right eye (atypical FH) was misclassified as having a normal foveal status. The hotspot focuses around the disc area.
Table 1.
 
The BCVA of Eyes With Corresponding Foveal Status
Table 1.
 
The BCVA of Eyes With Corresponding Foveal Status
Table 2.
 
Top Five Averaged Accuracy of Different Convolutional Neural Networks
Table 2.
 
Top Five Averaged Accuracy of Different Convolutional Neural Networks
Table 3A.
 
Comparison of CNN (EfficientNet_b1) Diagnostic Performance Against Clinician's Diagnostic Performance: Binary Classification
Table 3A.
 
Comparison of CNN (EfficientNet_b1) Diagnostic Performance Against Clinician's Diagnostic Performance: Binary Classification
Table 3B.
 
Comparison of CNN (EfficientNet_b1) Diagnostic Performance Against Clinician's Diagnostic Performance: Six-Grade Classification of Foveal Hypoplasia
Table 3B.
 
Comparison of CNN (EfficientNet_b1) Diagnostic Performance Against Clinician's Diagnostic Performance: Six-Grade Classification of Foveal Hypoplasia
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×