Open Access
Artificial Intelligence  |   July 2025
Equitable Deep Learning for Diabetic Retinopathy Detection Using Multidimensional Retinal Imaging With Fair Adaptive Scaling
Author Affiliations & Notes
  • Min Shi
    Harvard Ophthalmology AI Lab, Schepens Eye Research Institute of Massachusetts Eye and Ear, Harvard Medical School, Boston, MA, USA
  • Muhammad Muneeb Afzal
    Tandon School of Engineering, New York University, New York, NY, USA
  • Hao Huang
    Tandon School of Engineering, New York University, New York, NY, USA
  • Congcong Wen
    Tandon School of Engineering, New York University, New York, NY, USA
  • Yan Luo
    Harvard Ophthalmology AI Lab, Schepens Eye Research Institute of Massachusetts Eye and Ear, Harvard Medical School, Boston, MA, USA
  • Muhammad Osama Khan
    Tandon School of Engineering, New York University, New York, NY, USA
  • Yu Tian
    Harvard Ophthalmology AI Lab, Schepens Eye Research Institute of Massachusetts Eye and Ear, Harvard Medical School, Boston, MA, USA
  • Leo Kim
    Massachusetts Eye and Ear, Harvard Medical School, Boston, MA, USA
  • Yi Fang
    Tandon School of Engineering, New York University, New York, NY, USA
  • Mengyu Wang
    Harvard Ophthalmology AI Lab, Schepens Eye Research Institute of Massachusetts Eye and Ear, Harvard Medical School, Boston, MA, USA
  • Correspondence: Mengyu Wang, Schepens Eye Research Institute, 20 Staniford Street, Boston, MA 02114, USA. e-mail: [email protected] 
  • Footnotes
     MS, MMA, HH, CW, and YL contributed equally as co-first authors.
  • Footnotes
     YF and MW contributed equally as co-first senior authors.
Translational Vision Science & Technology July 2025, Vol.14, 1. doi:https://doi.org/10.1167/tvst.14.7.1
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Min Shi, Muhammad Muneeb Afzal, Hao Huang, Congcong Wen, Yan Luo, Muhammad Osama Khan, Yu Tian, Leo Kim, Yi Fang, Mengyu Wang; Equitable Deep Learning for Diabetic Retinopathy Detection Using Multidimensional Retinal Imaging With Fair Adaptive Scaling. Trans. Vis. Sci. Tech. 2025;14(7):1. https://doi.org/10.1167/tvst.14.7.1.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose: To investigate the fairness of existing deep models for diabetic retinopathy (DR) detection and introduce an equitable model to reduce group performance disparities.

Methods: We evaluated the performance and fairness of various deep learning models for DR detection using fundus images and optical coherence tomography (OCT) B-scans. A Fair Adaptive Scaling (FAS) module was developed to reduce group disparities. Model performance was evaluated using the area under the receiver operating characteristic curve (AUC), and equity across various groups was assessed by equity-scaled AUC, which accommodated both overall AUC and AUCs of individual groups.

Results: Using color fundus images, the integration of FAS with EfficientNet improved the overall AUC and equity-scaled AUC from 0.88 and 0.83 to 0.90 and 0.84 (P < 0.05) by race. AUCs for Asians and Whites increased by 0.05 and 0.03, respectively (P < 0.01). For gender, both metrics improved by 0.01 (P < 0.05). Using DenseNet121 on OCT B-Scans by race, FAS improved the overall AUC and equity-scaled AUC from 0.875 and 0.81 to 0.884 and 0.82, with gains of 0.03 and 0.02 for Asians and Blacks (P < 0.01). For gender, DenseNet121's metrics rose by 0.04 and 0.03, with gains of 0.05 and 0.04 for females and males (P < 0.01).

Conclusions: Deep learning models demonstrate varying accuracies across different groups in DR detection. FAS improves equity and accuracy of deep learning models.

Translational Relevance: The proposed deep learning model has a potential to improve both model performance and equity of DR detection.

Introduction
Diabetic retinopathy (DR), a common complication of diabetes, affects retinal blood vessels1,2 and is the leading cause of blindness among adults aged 20–74 years in the United States.35 Timely detection through regular eye exams is crucial to preserve vision, but access to ophthalmic care is often limited by resource scarcity and high costs. Racial and ethnic minorities, such as Blacks and Hispanics, are disproportionately affected, with DR prevalence 50% higher than non-Hispanic Whites.69 Additionally, Blacks and Hispanics with DR are more likely to experience severe vision loss (odds ratios = 1.78 and 1.68, respectively).10,11 Despite this higher disease burden, screening rates remain significantly lower in minority groups (49%) compared to non-Hispanic Whites (59%).12 
Automated DR detection using deep learning1316 through retinal imaging has emerged as an affordable solution to provide regular screenings, which aims to reduce health disparities among demographic groups. Although deep learning models for DR detection have demonstrated promising results,13,14,16,17 it is unclear whether they perform equitably across identity groups. This is critical to ensure fairness and uphold social justice in disease screening.1820 Disparities in model performance may stem from data inequality (e.g., underrepresentation of Black and Asian patients in datasets5) and data characteristic variability (e.g., sex- and race-related anatomical differences in retinas21,22). Addressing these factors is essential to reduce performance disparities and achieve equity in deep learning for DR detection. However, few studies have investigated these disparities or proposed solutions to enhance equity in DR screening. 
In this study, we evaluated state-of-the-art deep learning models for DR detection using two-dimensional (2D) fundus images and three-dimensional (3D) optical coherence tomography (OCT) B-scans. We analyzed performance disparities across race, gender, ethnicity, marital status, and preferred language. To address these disparities, we introduced an equitable deep learning model incorporating a Fair Adaptive Scaling (FAS) module (Fig. 1a). This module dynamically adjusts the significance of individual samples during training, improving equity in DR detection across identity groups. We tested these models using two proprietary datasets and two public datasets, which included wide-angle color fundus images, scanning laser ophthalmoscopy (SLO) fundus images, and OCT B-Scans. Performance was measured using the area under the receiver operating characteristic curve (AUC). To evaluate fairness, we introduced a novel equity-scaled AUC (ES-AUC) metric that balances overall AUC with performance disparities across groups. This extensive assessment verifies the potential of FAS in reducing disparities and improving automated DR detection. 
Figure 1.
 
The proposed equitable deep learning model for diabetic retinopathy detection. (a) Existing deep learning models demonstrate significant group performance disparities measured by equality-scaled AUC. In contrast, the proposed model with a fair adaptive scaling module can reduce group disparities. (b) The proposed fair adaptive scaling module utilizes demographic attributes to guide the model in dynamically adjusting the contributions of individual samples, enabling equitable DR detection across diverse identity groups. FAS incorporates individual scaling (li) and group scaling (βa) to compute a combined loss (ci) weight for every sample i, where c and (1 − c) are used to balance the individual and group scaling.
Figure 1.
 
The proposed equitable deep learning model for diabetic retinopathy detection. (a) Existing deep learning models demonstrate significant group performance disparities measured by equality-scaled AUC. In contrast, the proposed model with a fair adaptive scaling module can reduce group disparities. (b) The proposed fair adaptive scaling module utilizes demographic attributes to guide the model in dynamically adjusting the contributions of individual samples, enabling equitable DR detection across diverse identity groups. FAS incorporates individual scaling (li) and group scaling (βa) to compute a combined loss (ci) weight for every sample i, where c and (1 − c) are used to balance the individual and group scaling.
Methods
Two proprietary datasets, including fundus and OCT data, that were used for developing the equitable deep learning model were collected from Massachusetts Eye and Ear (MEE) between 2021 and 2023. Retinal images from MEE were acquired using Cirrus devices (Carl Zeiss Meditec, Dublin, CA, USA). All images were reliable with their signal strength not smaller than 6. The institutional review board of MEE approved the database for this retrospective study, adhering to the Declaration of Helsinki. In addition, two other public datasets were included in the evaluation. Informed consent was waived because of the retrospective study. 
Equitable Deep Learning Model With FAS
To enhance fairness in deep learning models for DR detection, we developed the FAS module (Fig. 1b). The model predicts DR or non-DR categories from an input image while considering associated identity attributes (e.g., gender, race, and ethnicity). FAS adjusts the training loss dynamically using learnable group weights and past individual loss values. Samples with higher group training error and prior loss are given greater importance in the current training batch. Specifically, FAS integrates both individual-level and group-level scaling mechanisms to adaptively modulate sample-wise loss weights during training. The individual scaling component adjusts loss contributions based on sample-specific difficulty, increasing weights for samples with either particularly high or low losses, and decreasing weights for those with less informative gradients. This strategy enables the model to emphasize under-learned or ambiguous instances, thereby enhancing learning efficiency and promoting individualized fairness. In parallel, group scaling dynamically adjusts loss weights at the group level based on inter-group distributional discrepancies. Demographic groups exhibiting greater divergence from the global distribution receive higher weighting to amplify their influence during optimization, while groups with more aligned distributions are down-weighted. This facilitates model attention toward underrepresented or marginalized groups, promoting equitable performance across demographic subpopulations. FAS can integrate seamlessly with state-of-the-art 2D and 3D models. In this work, we adopted EfficientNet23 and ViT-B24 as backbones to validate the effectiveness of FAS on 2D fundus images, namely EfficientNet + FAS and ViT-B + FAS, because they performed the best in most comparisons. To handle 3D OCT B-scans, we adopted two types of deep learning backbones combined with the FAS. The first type of backbone is adapted from the 2D models EfficientNet23 and ViT-B24 by adding a mapping initial layer to transform 200-channel OCT images into corresponding three-channel image, whereas the remaining learning architectures remain unchanged. The second type of backbone is the 3D versions of ResNet18 and DenseNet121, which perform 3D convolutions dedicated to 3D medical images.25 
Baseline Models for Comparison
We compared the proposed models with seven state-of-the-art deep learning models: VGG-16,26 Swin-B,27 ResNet,28 ConvNeXt,29 DenseNet,30 EfficientNet,23 and ViT-B.24 These models were trained using consistent pipelines and used a validation dataset to tune hyperparameters. To address performance disparities, adversarial training and data oversampling were also applied.3133 Adversarial training adds an additional classifier that prevents the model from learning identity-specific features from the input images.32 Oversampling addresses data imbalance by randomly duplicating samples to ensure that each identity group, such as different racial groups (Asian, Black, and White), has an equal number of samples. Additionally, transfer learning with EfficientNet and ViT-B was explored,34,35 where a global model was initially trained using samples from various identity groups. Subsequently, this global model was tailored to each specific identity group through fine-tuning with all available samples within that particular group. For these models, we have tuned the hyper-parameters based on a validation set. Please refer to our detailed implementations of these comparative models. 
Evaluation Metrics and Statistical Analysis
Statistical analyses were conducted in Python 3.8. Model performance was assessed using the area under the receiver operating characteristic curve (AUC). To evaluate fairness, we introduced the equity-scaled AUC (ES-AUC), defined as the overall AUC divided by one plus the sum of the absolute differences between the overall AUC and each group's AUC, formulated as \(AU{{C}_{ES}} = AU{{C}_{overall}}/( {1 + \mathop \sum \nolimits_i^K | {AU{{C}_{overall}} - AU{{C}_i}} |} ),\) where K is the number of groups (e.g., Asian, Black, White) based a demographic attribute (e.g., race) and AUCi is the AUC for group i. This metric balances overall AUCoverall with performance disparities among different groups. A maximized ES-AUC requires that group AUCs are close to the overall AUC, thus achieving minimized group AUC disparities. In addition, we also reported mean disparity and max disparity as complementary fairness measures. Mean disparity was calculated as the ratio of the standard deviation of individual group AUCs to the adjusted overall AUC, obtained by subtracting 0.5 from the overall AUC. Similarly, max disparity was determined by the ratio of the difference between the highest and lowest individual group AUCs to this adjusted overall AUC. Lower max and mean disparities signify better model equity. In addition to the primary metrics reported above. Statistical significance was evaluated using t-test and bootstrapping to compare AUC and ES-AUC values between models with and without FAS. Bootstrapping provided confidence intervals and standard error estimates. Results with P < 0.05 were considered statistically significant. Sensitivity and ES-sensitivity were also calculated at fixed specificity of 0.9 and 0.95, respectively. Detailed sensitivity analyses are provided in the supplemental material. The model-learned features for different identity groups can be projected into a 2D feature space based on the uniform manifold approximation and projection (UMAP) for the purpose of visualization. 
Results
A Summary of Collected Datasets
MEE In-House Data
It comprises a wide-angle color fundus dataset and an SLO fundus dataset paired with OCT B-scans from the same eye at the same visit. The wide-angle color fundus images were collected from 22,622 patients, with an average age of 57.4 ± 19.4 years. The demographic distribution included 51.6% females and 48.4% males; 6.5% Asians, 9.3% Blacks, and 84.2% Whites; 96.7% non-Hispanic and 3.3% Hispanic; and 93.1% preferred English, 1.3% Spanish, and 5.6% other languages (Supplemental Fig. S1). Regarding DR status, 95.3% were identified as non-vision-threatening DR, and 4.7% were vision-threatening DR. The SLO fundus images and OCT B-scans were collected from 49,164 patients, with an average age of 63.9 ± 17.4 years. Demographics included 58.3% females and 41.7% males; 7.9% Asians, 12.4% Blacks, and 79.6% Whites; 96.2% non-Hispanic and 3.8% Hispanic; and 91.4% preferred English, 1.7% Spanish, and 6.9% other languages (Supplemental Fig. S2). For DR status, 97.7% were identified as non-vision-threatening DR, and 2.3% were vision-threatening DR. Patients were categorized based on DR status: non-vision-threatening DR included normal, mild, and moderate NPDR, whereas vision-threatening DR included severe NPDR and PDR. Diagnoses were extracted from International Classification of Diseases codes in electronic health records. 
Harvard-FairVision30k
This is a public dataset from MEE for studying fairness in eye disease screening, including SLO fundus images and OCT B-scans from 10,000 patients, with an average age of 64.5 ± 16.5 years. Demographics included 55.5% females and 44.5% males; 7.6% Asians, 14.6% Blacks, and 77.8% Whites; 96.1% non-Hispanic and 3.9% Hispanic; and 90.9% preferred English, 2.0% Spanish, and 7.1% other languages (Supplemental Fig. S3). Regarding DR status, 90.9% were non-vision-threatening DR, and 9.1% were vision-threatening DR. 
ODIR-5K
This is a public dataset used for eye disease screening with color fundus images. After processing, it included 6392 images from 3358 patients, with an average age of 57.9 ± 11.7 years. Demographics included 46.4% females and 53.6% males. For DR status, 66.8% were non-DR, and 33.2% were DR. 
Results for Color Fundus Images
For racial attribute, ViT-B achieved the best overall AUC of 0.90 and ES-AUC of 0.84 among seven state-of-the-art baseline deep learning models, followed by the EfficientNet with its overall AUC and ES-AUC being 0.88 and 0.83, respectively (Fig. 2a1). Data oversampling and adversarial training significantly improved the overall AUC performances of VGG, ResNet and ConvNeXt (P < 0.05) but were not shown to be useful for other models (Supplemental Fig. S4). With transfer learning, EfficientNet significantly improved the AUC performances for Asians and Whites up to 0.04 (P < 0.01) and 0.01 (P < 0.05) respectively (Fig. 2a2). After integrating with FAS, the overall AUC and ES-AUC of EfficientNet improved from 0.88 and 0.83 to 0.90 and 0.84 (p < 0.05), where the AUCs for Asians and Whites improved by 0.05 and 0.03, respectively (P < 0.01, Fig. 2a2). Similarly, with FAS, the AUCs of ViT-B for Asians and Whites improved by 0.02 and 0.01 (P < 0.05), respectively. For the gender attribute, ViT-B and EfficientNet remained the best performing baseline models with both ES-AUC being 0.88 (Fig. 2b1). Adversarial training boosted the AUC of EfficientNet for Females by 0.02 (P < 0.01), whereas the oversampling and transfer learning did not bring significant AUC improvements for either EfficientNet and ViT-B (Figs. 2b2, 2b3). With FAS, the overall AUC and ES-AUC of EfficientNet both improved by 0.01 (P < 0.05), whereas the AUC and ES-AUC for ViT-B increased by 0.02 and 0.03 (P < 0.01), respectively. For the ethnic attribute, oversampling, transfer learning and adversarial training could not enhance EfficientNet and ViT-B (Figs. 2c2, 2c3). In contrast, the overall AUC and AUC of EfficientNet with FAS for non-Hispanic group both improved by 0.01 (P < 0.05, Fig. 2c2). The overall AUC, ES-AUC and AUC of ViT-B with FAS for Hispanic group all increased by 0.01 (P < 0.05, Fig. 2c3). 
Figure 2.
 
Results on in-house color fundus images. (a1, b1, c1) Accuracy by race, gender, and ethnicity. (a2, b2, c2) Accuracy of EfficientNet with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity. (a3, b3, c3) Accuracy of ViT-B with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity.
Figure 2.
 
Results on in-house color fundus images. (a1, b1, c1) Accuracy by race, gender, and ethnicity. (a2, b2, c2) Accuracy of EfficientNet with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity. (a3, b3, c3) Accuracy of ViT-B with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity.
Similar results can be observed when using ODIR-5K dataset for the gender attribute. With FAS, the overall AUC and ES-AUC both improved by 0.01 (P < 0.05, Supplemental Fig. S8). The overall AUC and ES-AUC for ViT-B improved by 0.03 and 0.02 respectively, after integrating the FAS (P < 0.01, Supplemental Fig. S8). The AUCs for Females and Males improved from 0.75 and 0.76 to 0.78 and 0.78, respectively. 
Results for SLO Fundus Images
Using in-house MEE dataset on the racial attribute, ViT-B achieved the highest overall AUC of 0.82, whereas Swin-B achieved the highest ES-AUC of 0.77 (Fig. 3a1). In general, the strategies of oversampling, transfer learning and adversarial training could not improve the overall AUC and ES-AUC performances for both EfficientNet and ViT-B (Figs. 3a2, 3a3). In contrast, with FAS, the overall AUC of EfficientNet significantly improved from 0.80 to 0.83 (P < 0.01), where the AUCs for Asians, Blacks and Whites improved by 0.02, 0.01 and 0.04, respectively (P < 0.05, Figs. 3a2). The overall AUC and ES-AUC of ViT-B with FAS increased from 0.82 and 0.71 to 0.84 and 0.75, respectively. In subgroups, the AUCs for Asians, Blacks and Whites significantly improved by 0.02, 0.03, and 0.02, respectively (P < 0.01, Fig. 3a3). On the gender attribute, conventional strategies such as oversampling, transfer learning, and adversarial training strategies failed to boost model performance and equity, whereas FAS significantly boosted EfficientNet and ViT-B (Figs. 3b2, 3b3). Specifically, FAS improved EfficientNet's overall AUC and ES-AUC by 0.02 (P < 0.01), where the same improvement of 0.02 was achieved for Females and Males (p < 0.01, Figs. 3b2). Similarly, with FAS, the overall AUC and ES-AUC of ViT-B improved by 0.02 and 0.01, where females and males had improvements of 0.03 and 0.02, respectively (P < 0.05, Fig. 3b3). On the ethnic attribute, after integrating FAS, the overall AUC and ES-AUC of EfficientNet improved by 0.02 and 0.04, respectively (P < 0.01, Fig. 3c2). The AUC for non-Hispanic group improved 0.02, but no improvement was observed for the Hispanic group (Fig. 3c2). With FAS, the overall AUC of ViT-B improved from 0.82 to 0.84, where the non-Hispanic group improved by 0.03 (P < 0.01, Fig. 3c3), although no improvement was observed for the Hispanic group. 
Figure 3.
 
Results on in-house SLO fundus images. (a1, b1, c1) Accuracy by race, gender, and ethnicity. (a2, b2, c2) Accuracy of EfficientNet with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity. (a3, b3, c3) Accuracy of ViT-B with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity.
Figure 3.
 
Results on in-house SLO fundus images. (a1, b1, c1) Accuracy by race, gender, and ethnicity. (a2, b2, c2) Accuracy of EfficientNet with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity. (a3, b3, c3) Accuracy of ViT-B with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity.
On the Harvard-FairVision30k dataset, FAS was also shown to be effective in boosting the overall AUC performance and reducing group performance disparities. For example, for the racial attribute, the AUC and ES-AUC of EfficientNet with FAS improved from 0.79 and 0.67 to 0.81 and 0.74, respectively. Notably, significant AUC improvements of 0.04 and 0.07 were achieved for Asians and Blacks, respectively (P < 0.01, Supplemental Fig. S13). Similarly, the performance disparities for ViT-B were significantly improved after integrating FAS, with the ES-AUC and AUCs for Asians and Blacks each improving by 0.02 (P < 0.01). 
Results for 3D OCT B-Scans
DenseNet121 and ResNet18, based on 3D convolutions with or without integrating the FAS, were evaluated on Race, Gender, and Ethnicity. Compared with DenseNet121 on the racial attribute using in-house OCT B-Scans, DenseNet121 + FAS improved the overall AUC and ES-AUC from 0.875 and 0.81 to 0.890 and 0.83, respectively (P < 0.01, Table 1), where the AUCs for Asians and Blacks improved by 0.032 and 0.02. Similarly, for Resnet18 with FAS, the overall AUC and ES-AUC both improved by 0.012 (P < 0.05, Table 1), with a more prominent AUC improvement for Asians (0.026) compared with Blacks (0.011) and Whites (0.011). On the gender attribute, FAS improved the overall AUC and ES-AUC of DenseNet121 by 0.044 and 0.027, where the AUCs for Females and Males improved by 0.054 and 0.035, respectively (P < 0.01, Table 1). After integrating FAS with Resnet18 on Gender, the overall AUC and ES-AUC significantly increased from 0.872 and 0.856 to 0.903 and 0.882, respectively. On the ethnic attribute, the overall AUC for DenseNet121 integrating the FAS improved by 0.019, although the ES-AUC showed no improvement (Table 1). The overall AUC and ES-AUC for ResNet18 + FAS improved over ResNet18 by 0.022 and 0.062, respectively, with the non-Hispanic group improving from 0.87 to 0.904, and Hispanic groups remaining nearly unchanged (Table 1). 
Table 1.
 
Experimental Results on OCT B-Scans Using 3D Deep Learning Models
Table 1.
 
Experimental Results on OCT B-Scans Using 3D Deep Learning Models
On the Harvard-FairVision30K dataset, we can observe consistent improvement after integrating FAS with DenseNet121 and ResNet18 (Table 1). On the racial attribute, FAS improved the overall AUC of DenseNet121 by 0.01 and ES-AUC by 0.055, with significant AUC improvements of 0.051 and 0.024 for Asians and Blacks, respectively (P < 0.01, Table 1). For DenseNet18 + FAS, the overall AUC and ES-AUC improved by 0.023 and 0.047, respectively. The AUCs for Blacks significantly improved from 0.825 to 0.879. On the gender attribute, the improvements in model performance and disparity were marginal for DenseNet121 after integrating with FAS. In contrast, with FAS, DenseNet18’s overall AUC and ES-AUC significantly improved by 0.017 and 0.02 (P < 0.01, Table 1). For the ethnic attribute, the overall AUC DenseNet18 after integrating the FAS improved from 0.876 to 0.895, whereas the improvement for DenseNet18 + FAS was not significant. 
Discussion
As deep learning models are increasingly used for automated disease screening, it is crucial to ensure equitable performance across diverse identity groups. In this study, we evaluated seven state-of-the-art deep learning models for DR detection using three datasets comprising 2D wide-angle color fundus, SLO fundus, and 3D OCT B-Scans. Results revealed significant performance disparities, particularly for underrepresented groups. To address these disparities, we developed the FAS module, which dynamically adjusts the contribution of each sample based on associated identity attributes. We demonstrated the effectiveness of FAS through its integration with EfficientNet and ViT-B and compared it with three conventional strategies for reducing group performance disparities, including data oversampling, transfer learning, and adversarial training. 
Compared with existing strategies like data oversampling, transfer learning, and adversarial training, FAS demonstrated superior effectiveness and robustness across various datasets and identity attributes. Although oversampling addresses data imbalance and adversarial training reduces identity-specific biases, FAS offers a dynamic mechanism that balances fairness and performance more comprehensively. It is worth mentioning that FAS is different from the existing model DcardNet.36 First, DcardNet aims to reduce overfitting by smoothing classification labels, whereas FAS aims to achieve fair performance across different groups by adjusting weights of individual samples. Second, DcardNet adaptively smooths the class labels by emphasizing the optimization of misclassified samples individually, whereas FAS adopts both group and individual scaling to emphasize both group similarities and individual variations. Unlike fixed group weights adopted in existing works, FAS updates weights dynamically to balance fairness at the group level while accounting for within-group sample variations. This avoids over- and under-weighting issues caused by outliers. FAS is also essentially different from another existing model called Fair Identity Normalization,37 which adopts optimal transport using real-time data to measure and address fairness across groups. However, integrating FAS into EfficientNet and ViT-B will add the complexity of the model (Table 2) and slightly decrease computational efficiency. 
Table 2.
 
Comparison of Parameter Counts and Computational Efficiency Between EfficientNet and ViT-B Models, With and Without Fair Adaptive Scaling Integration, Evaluated on SLO Fundus Images
Table 2.
 
Comparison of Parameter Counts and Computational Efficiency Between EfficientNet and ViT-B Models, With and Without Fair Adaptive Scaling Integration, Evaluated on SLO Fundus Images
FAS can affect the way the model learns features from the input image sample in order to achieve improved model performance and reduced group disparities (Fig. 4). The UMAP-based distribution of features learned by the existing deep learning model was highly indistinguishable across different identity groups and centralized (Figs. 4a–c). In contrast, the distribution of features from the deep learning model with FAS had clearer boundaries and was more spread out in the feature space. Such a reformed feature distribution, incurred by FAS, may have contributed to the improvement of overall model performance and reduced group disparities in DR detection (Figs. 4d–f). 
Figure 4.
 
UMAP-generated distribution of features learned from in-house SLO fundus images by the existing baseline EfficientNet model and the EfficientNet + FAS model. (a) EfficientNet on Race. (b) EfficientNet on Gender. (c) EfficientNet on Ethnicity. (d) EfficientNet + FAS on Race. (e) EfficientNet + FAS on Gender. (f) EfficientNet + FAS on Ethnicity.
Figure 4.
 
UMAP-generated distribution of features learned from in-house SLO fundus images by the existing baseline EfficientNet model and the EfficientNet + FAS model. (a) EfficientNet on Race. (b) EfficientNet on Gender. (c) EfficientNet on Ethnicity. (d) EfficientNet + FAS on Race. (e) EfficientNet + FAS on Gender. (f) EfficientNet + FAS on Ethnicity.
The experimental results generally showed that 2D and 3D models with FAS improved the AUC and ES-AUC on sensitive attributes including race, gender, and ethnicity on inhouse MEE and Harvard-FairVision30k datasets. However, FAS did not show consistent improvements on other demographic attributes and datasets. Several potential factors contribute to these results: (1) Patients may belong to different or the same subgroups when demographic attributes change, complicating the task for fairness learning models like FAS to consistently enhance performance and equity across diverse groups and demographics; (2) Disparities in group performance can arise from various reasons, including data imbalances, noisy labels, and variations in anatomical and pathological features in fundus images across different demographic groups. Different datasets may be affected by one or several of these factors, making it difficult for a unified fairness learning model to be effective across all datasets in different settings. These potential factors highlight that mitigating model unfairness is a complicated problem that necessitates systematic consideration with sophisticated modeling to account for various groups and attributes. 
Our study has several limitations. First, FAS did not consistently improve all metrics across sensitive attributes such as language and marital status. Structural variance in retinal images across subgroups may have contributed to these inconsistencies. Second, equity metrics such as mean and max disparities showed inconsistencies compared to ES-AUC. In this work, the ES-AUC is treated as a more comprehensive equity measurement than mean and max disparities, given that mean and max disparities do not fully consider the variance of subgroup performances. Additionally, other fairness metrics such as demographic parity, equalized odds, and equal opportunity can also be adopted. Third, we have thoroughly validated how fair the model would be regarding data sample size for different sensitive attributes. The data sample sizes involved in this study were relatively large, which could bias the model performance and equity. However, we tested the influence of sample sizes using the in-house color fundus on Race, Gender, and Ethnicity. The experiments demonstrated that the issue of model inequity existed at different scales of data samples, and the proposed deep learning model with FAS helped to mitigate model performance disparities across different identity groups (Supplemental Figs. S24S26). Last, we did not fully explore FAS's compatibility with other supervised models, such as the Swin network, or unsupervised models like masked autoencoders, even though FAS has the versatility to be paired with various learning frameworks. 
Conclusions
In this work, we investigated the performance and fairness of various deep learning models, which showed varying group accuracies. To mitigate this issue, we proposed a FAS technique that adjusted feature importance for demographic attributes. FAS is a versatile module that improves model performance equity for DR detection. Extensive evaluations across multiple datasets and comparisons with conventional strategies highlighted FAS's effectiveness in boosting both overall accuracy and fairness, particularly for underrepresented groups. This work underscores the potential of FAS to enhance equity in automated disease screening. 
Acknowledgments
Supported by NIH R00 EY028631, NIH R21 EY035298, Research To Prevent Blindness International Research Collaborators Award, and Alcon Young Investigator Grant. 
Author Contributors: M.W., Y.F. and M.S. conceived the study. M.S., C.W., M.A., H.H., Y.L., and M.W. wrote the manuscript and performed the data processing, experiment, and analysis. M.W., Y.F., M.S., C.W., M.A., H.H., Y.L., M.K., and L.K. contributed materials and clinical expertise. M.W. and Y.F. supervised the work. All authors contributed to the experimental design, the interpretation of the results, and the editing of the final manuscript. All authors accept the final responsibility to submit for publication and take responsibility for the contents of the manuscript. 
Data Availability: The Harvard-FairVision30k dataset is available through the public link https://ophai.hms.harvard.edu/datasets/harvard-fairvision30k and was used with approvals. The ODIR-5K dataset is publicly available at https://www.kaggle.com/datasets/andrewmvd/ocular-disease-recognition-odir5k. The in-house data were provided by the Massachusetts Eye and Ear (MEE). The institutional review boards (IRB) of MEE approved the creation of the database in this retrospective study. 
Code Availability: All codes are open source and available at https://github.com/Harvard-Ophthalmology-AI-Lab/FairAdaptiveScaling
Disclosure: M. Shi, None; M.M. Afzal, None; H. Huang, None; C. Wen, None; Y. Luo, None; M.O. Khan, None; Y. Tian, None; L. Kim, None; Y. Fang, None; M. Wang, None 
References
Fong DS, Aiello L, Gardner TW, et al. Retinopathy in Diabetes. Diabetes Care. 2004; 27(suppl_1): s84–s87. [CrossRef] [PubMed]
Mohamed Q, Gillies MC, Wong TY. Management of diabetic retinopathy: a systematic review. JAMA. 2007; 298: 902–916. [CrossRef] [PubMed]
Lee R, Wong TY, Sabanayagam C. Epidemiology of diabetic retinopathy, diabetic macular edema and related vision loss. Eye Vis. 2015; 2: 1–25. [CrossRef]
Kempen JH, O'Colmain BJ, Leske MC, et al. The prevalence of diabetic retinopathy among adults in the United States. Arch Ophthalmol. 2004; 122: 552–563. [PubMed]
Zhang X, Saaddine JB, Chou CF, et al. Prevalence of diabetic retinopathy in the United States, 2005–2008. JAMA. 2010; 304: 649–656. [CrossRef] [PubMed]
Harris EL, Feldman S, Robinson CR, Sherman S, Georgopoulos A. Racial differences in the relationship between blood pressure and risk of retinopathy among individuals with NIDDM. Diabetes Care. 1993; 16: 748–754. [CrossRef] [PubMed]
Wong TY, Klein R, Islam FMA, et al. Diabetic retinopathy in a multi-ethnic cohort in the United States. Am J Ophthalmol. 2006; 141: 446–455.e1. [CrossRef] [PubMed]
Harris MI, Klein R, Cowie CC, Rowland M, Byrd-Holt DD. Is the risk of diabetic retinopathy greater in non-Hispanic Blacks and Mexican Americans than in Non-Hispanic Whites with type 2 diabetes?: A U.S. population study. Diabetes Care. 1998; 21: 1230–1235. [CrossRef] [PubMed]
Harris EL, Sherman SH, Georgopoulos A. Black-white differences in risk of developing retinopathy among individuals with type 2 diabetes. Diabetes Care. 1999; 22: 779–783. [CrossRef] [PubMed]
Barsegian A, Kotlyar B, Lee J, Salifu M, McFarlane S. Diabetic retinopathy: focus on minority populations. Int J Clin Endocrinol Metab. 2017; 3: 034–045. [CrossRef] [PubMed]
Zhang X . Diabetes mellitus and visual impairment. Arch Ophthalmol. 2008; 126: 1421–1427. [CrossRef] [PubMed]
Shi Q, Zhao Y, Fonseca V, Krousel-Wood M, Shi L. Racial disparity of eye examinations among the U.S. working-age population with diabetes: 2002–2009. Diabetes Care. 2014; 37: 1321–1328. [CrossRef] [PubMed]
Gulshan V, Peng L, Coram M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016; 316: 2402–2410. [CrossRef] [PubMed]
Gargeya R, Leng T. Automated identification of diabetic retinopathy using deep learning. Ophthalmology. 2017; 124: 962–969. [CrossRef] [PubMed]
Dai L, Wu L, Li H, et al. A deep learning system for detecting diabetic retinopathy across the disease spectrum. Nat Commun. 2021; 12(1): 3242. [CrossRef] [PubMed]
Bora A, Balasubramanian S, Babenko B, et al. Predicting the risk of developing diabetic retinopathy using deep learning. Lancet Digit Health. 2021; 3(1): e10–e19. [CrossRef] [PubMed]
Bellemo V, Lim ZW, Lim G, et al. Artificial intelligence using deep learning to screen for referable and vision-threatening diabetic retinopathy in Africa: a clinical validation study. Lancet Digit Health. 2019; 1(1): e35–e44. [CrossRef] [PubMed]
Luo Y, Shi M, Khan MO, et al. FairCLIP: Harnessing Fairness in Vision-Language Learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).; 2024: 12289–12301.
Tian Y, Wen C, Shi M, et al. FairDomain: achieving fairness in cross-domain medical image segmentation and classification . arXiv. Preprint posted online July 11, 2024, doi:240708813.
Tian Y, Shi M, Luo Y, Kouhana A, Elze T, Wang M. FairSeg: a large-scale medical image segmentation dataset for fairness learning using segment anything model with fair error-bound scaling. arXiv. Preprint posted online November 3, 2023, doi:2311.02189.
Coyner AS, Singh P, Brown JM, et al. Association of biomarker-based artificial intelligence with risk of racial bias in retinal images. JAMA Ophthalmol. 2023; 141: 543–552. [CrossRef] [PubMed]
Betzler BK, Yang HHS, Thakur S, et al. Gender prediction for a multiethnic population via deep learning across different retinal fundus photograph fields: retrospective cross-sectional study. JMIR Med Inform. 2021; 9(8): e25165. [CrossRef] [PubMed]
Tan M, Le QV. EfficientNet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning. New York: PMLR; 2019: 6105–6114.
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv. Preprint posted online October 22, 2020, doi:2010.11929.
Yang J, Huang X, He Y, et al. Reinventing 2D convolutions for 3D images. IEEE J Biomed Health Inform. 2021; 25: 3009–3018. [CrossRef] [PubMed]
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv. Preprint posted online September 4, 2014, doi:1409.1556.
Liu Z, Lin Y, Cao Y, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). Piscataway, NJ: IEEE; 2021: 9992–10002.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE; 2016: 770–778.
Liu Z, Mao H, Wu CY, Feichtenhofer C, Darrell T, Xie S. A ConvNet for the 2020s. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ: IEEE; 2022: 11966–11976.
Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely Connected Convolutional Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ: IEEE; 2017: 2261–2269, doi:10.1109/CVPR.2017.243.
Xu H, Liu X, Li Y, Jain AK, Tang J. To be robust or to be fair: towards fairness in adversarial training. In: International Conference on Machine Learning. New York: PMLR; 2021: 11492–11501.
Yang J, Soltan AAS, Eyre DW, Yang Y, Clifton DA. An adversarial training framework for mitigating algorithmic biases in clinical machine learning. NPJ Digit Med. 2023; 6(1): 55. [CrossRef] [PubMed]
Qraitem M, Saenko K, Plummer BA. Bias Mimicking: A Simple Sampling Approach for Bias Mitigation. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Piscataway, NJ: IEEE; 2023: 20311–20320.
Serener A, Serte S. Transfer learning for early and advanced glaucoma detection with convolutional neural networks. In: 2019 Medical Technologies Congress (TIPTEKNO). Piscataway, NJ: IEEE; 2019: 1–4.
Asaoka R, Murata H, Hirasawa K, et al. Using deep learning and transfer learning to accurately diagnose early-onset glaucoma from macular optical coherence tomography images. Am J Ophthalmol. 2019; 198: 136–145. [CrossRef] [PubMed]
Zang P, Gao L, Hormel TT, et al. DcardNet: diabetic retinopathy classification at multiple levels based on structural and angiographic optical coherence tomography. IEEE Trans Biomed Eng. 2021; 68: 1859–1870. [CrossRef] [PubMed]
Luo Y, Tian Y, Shi M, Elze T, Wang M. Harvard Eye Fairness: A Large-Scale 3D Imaging Dataset for Equitable Eye Diseases Screening and Fair Identity Scaling. Available at OpenReview.net.
Figure 1.
 
The proposed equitable deep learning model for diabetic retinopathy detection. (a) Existing deep learning models demonstrate significant group performance disparities measured by equality-scaled AUC. In contrast, the proposed model with a fair adaptive scaling module can reduce group disparities. (b) The proposed fair adaptive scaling module utilizes demographic attributes to guide the model in dynamically adjusting the contributions of individual samples, enabling equitable DR detection across diverse identity groups. FAS incorporates individual scaling (li) and group scaling (βa) to compute a combined loss (ci) weight for every sample i, where c and (1 − c) are used to balance the individual and group scaling.
Figure 1.
 
The proposed equitable deep learning model for diabetic retinopathy detection. (a) Existing deep learning models demonstrate significant group performance disparities measured by equality-scaled AUC. In contrast, the proposed model with a fair adaptive scaling module can reduce group disparities. (b) The proposed fair adaptive scaling module utilizes demographic attributes to guide the model in dynamically adjusting the contributions of individual samples, enabling equitable DR detection across diverse identity groups. FAS incorporates individual scaling (li) and group scaling (βa) to compute a combined loss (ci) weight for every sample i, where c and (1 − c) are used to balance the individual and group scaling.
Figure 2.
 
Results on in-house color fundus images. (a1, b1, c1) Accuracy by race, gender, and ethnicity. (a2, b2, c2) Accuracy of EfficientNet with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity. (a3, b3, c3) Accuracy of ViT-B with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity.
Figure 2.
 
Results on in-house color fundus images. (a1, b1, c1) Accuracy by race, gender, and ethnicity. (a2, b2, c2) Accuracy of EfficientNet with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity. (a3, b3, c3) Accuracy of ViT-B with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity.
Figure 3.
 
Results on in-house SLO fundus images. (a1, b1, c1) Accuracy by race, gender, and ethnicity. (a2, b2, c2) Accuracy of EfficientNet with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity. (a3, b3, c3) Accuracy of ViT-B with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity.
Figure 3.
 
Results on in-house SLO fundus images. (a1, b1, c1) Accuracy by race, gender, and ethnicity. (a2, b2, c2) Accuracy of EfficientNet with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity. (a3, b3, c3) Accuracy of ViT-B with oversampling, adversarial training, transfer learning, and FAS by race, gender, and ethnicity.
Figure 4.
 
UMAP-generated distribution of features learned from in-house SLO fundus images by the existing baseline EfficientNet model and the EfficientNet + FAS model. (a) EfficientNet on Race. (b) EfficientNet on Gender. (c) EfficientNet on Ethnicity. (d) EfficientNet + FAS on Race. (e) EfficientNet + FAS on Gender. (f) EfficientNet + FAS on Ethnicity.
Figure 4.
 
UMAP-generated distribution of features learned from in-house SLO fundus images by the existing baseline EfficientNet model and the EfficientNet + FAS model. (a) EfficientNet on Race. (b) EfficientNet on Gender. (c) EfficientNet on Ethnicity. (d) EfficientNet + FAS on Race. (e) EfficientNet + FAS on Gender. (f) EfficientNet + FAS on Ethnicity.
Table 1.
 
Experimental Results on OCT B-Scans Using 3D Deep Learning Models
Table 1.
 
Experimental Results on OCT B-Scans Using 3D Deep Learning Models
Table 2.
 
Comparison of Parameter Counts and Computational Efficiency Between EfficientNet and ViT-B Models, With and Without Fair Adaptive Scaling Integration, Evaluated on SLO Fundus Images
Table 2.
 
Comparison of Parameter Counts and Computational Efficiency Between EfficientNet and ViT-B Models, With and Without Fair Adaptive Scaling Integration, Evaluated on SLO Fundus Images
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×