Translational Vision Science & Technology Cover Image for Volume 14, Issue 6
June 2025
Volume 14, Issue 6
Open Access
Artificial Intelligence  |   June 2025
Robust Uncertainty-Informed Glaucoma Classification Under Data Shift
Author Affiliations & Notes
  • Homa Rashidisabet
    Department of Biomedical Engineering, University of Illinois Chicago, Chicago, IL, USA
    Artificial Intelligence in Ophthalmology (Ai-O) Center, University of Illinois Chicago, Chicago, IL, USA
    Illinois Eye and Ear Infirmary, Department of Ophthalmology and Visual Sciences, University of Illinois Chicago, Chicago, IL, USA
  • R. V. Paul Chan
    Artificial Intelligence in Ophthalmology (Ai-O) Center, University of Illinois Chicago, Chicago, IL, USA
    Illinois Eye and Ear Infirmary, Department of Ophthalmology and Visual Sciences, University of Illinois Chicago, Chicago, IL, USA
  • Yannek I. Leiderman
    Duke Eye Center, Department of Ophthalmology, Duke University, Durham, NC, USA
  • Thasarat Sutabutr Vajaranant
    Artificial Intelligence in Ophthalmology (Ai-O) Center, University of Illinois Chicago, Chicago, IL, USA
    Illinois Eye and Ear Infirmary, Department of Ophthalmology and Visual Sciences, University of Illinois Chicago, Chicago, IL, USA
  • Darvin Yi
    Artificial Intelligence in Ophthalmology (Ai-O) Center, University of Illinois Chicago, Chicago, IL, USA
    Illinois Eye and Ear Infirmary, Department of Ophthalmology and Visual Sciences, University of Illinois Chicago, Chicago, IL, USA
  • Correspondence: Darvin Yi, Illinois Eye and Ear Infirmary, Department of Ophthalmology and Visual Sciences, University of Illinois Chicago, 1855 W. Taylor St., Ste 1, Chicago, IL 60612, USA. e-mail: [email protected] 
Translational Vision Science & Technology June 2025, Vol.14, 3. doi:https://doi.org/10.1167/tvst.14.6.3
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Homa Rashidisabet, R. V. Paul Chan, Yannek I. Leiderman, Thasarat Sutabutr Vajaranant, Darvin Yi; Robust Uncertainty-Informed Glaucoma Classification Under Data Shift. Trans. Vis. Sci. Tech. 2025;14(6):3. https://doi.org/10.1167/tvst.14.6.3.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose: Standard deep learning (DL) models often suffer significant performance degradation on out-of-distribution (OOD) data, where test data differs from training data, a common challenge in medical imaging due to real-world variations.

Methods: We propose a unified self-censorship framework as an alternative to the standard DL models for glaucoma classification using deep evidential uncertainty quantification. Our approach detects OOD samples at both the dataset and image levels. Dataset-level self-censorship enables users to accept or reject predictions for an entire new dataset based on model uncertainty, whereas image-level self-censorship refrains from making predictions on individual OOD images rather than risking incorrect classifications. We validated our approach across diverse datasets.

Results: Our dataset-level self-censorship method outperforms the standard DL model in OOD detection, achieving an average 11.93% higher area under the curve (AUC) across 14 OOD datasets. Similarly, our image-level self-censorship model improves glaucoma classification accuracy by an average of 17.22% across 4 external glaucoma datasets against baselines while censoring 28.25% more data.

Conclusions: Our approach addresses the challenge of generalization in standard DL models for glaucoma classification across diverse datasets by selectively withholding predictions when the model is uncertain. This method reduces misclassification errors compared to state-of-the-art baselines, particularly for OOD cases.

Translational Relevance: This study introduces a tunable framework that explores the trade-off between prediction accuracy and data retention in glaucoma prediction. By managing uncertainty in model outputs, the approach lays a foundation for future decision support tools aimed at improving the reliability of automated glaucoma diagnosis.

Introduction
Glaucoma, a leading cause of irreversible blindness, currently affects over 80 million people globally.1,2 Despite glaucoma's severity and prevalence, its early detection could prevent vision loss in patients with glaucoma.2,3 Recent advancements in deep learning (DL) have demonstrated promising results in automated glaucoma detection.420 Most prior studies, using the standard DL framework, have reported an accuracy rate of over 90% for DL-based glaucoma diagnosis using fundus photographs. However, these state-of-the-art standard DL models for glaucoma diagnosis lack robustness across different datasets.9,11,21 Their performance significantly degrades when the model is trained on a different dataset than the testing dataset due to a shift in distribution,9,11,21 a challenge commonly known as out-of-distribution (OOD) generalization.22 Yet, the standard DL framework remains the predominant approach in the current research. 
In real-world medical imaging, OOD can occur at both the dataset level and the individual image level. At the dataset level, OOD challenges could arise from factors such as data collection in different institutions, under different protocols and cameras, from different patient populations, and the presence of retinal diseases not included in the training set.21,23 At the image level, some images may be out of categories in the training set, with atypical features, poor image quality, or the inclusion of non-fundus images.21,24 Regardless of the source, standard DL models exhibit significant performance degradation and poor generalization on OOD data.22,23,25,26 This limitation in medical diagnosis can lead to critical errors, increasing the risk of misdiagnoses that may compromise patient health. 
The persistence of the OOD challenge stems from the absence of a universal dataset that encompasses all potential real-world variations, including diverse races, ethnicities, ages, retinal diseases, imaging devices, protocols, institutions, and more. If such an all-inclusive dataset were available to the research community, DL models could effectively learn the full range of variability present in test data, making OOD a non-issue. However, the collection of such a dataset faces numerous real-world barriers, including cross-institutional data-sharing limitations, patient privacy concerns, regulatory restrictions, and logistical challenges. As a result, in the absence of such a comprehensive dataset, algorithmic advancements remain essential for improving the generalizability of standard DL models on OOD data in automated medical diagnoses. 
One promising algorithmic advancement aimed at addressing the OOD generalization challenge is federated learning (FL).27,28 FL enables collaborative model training across institutions while allowing clients to keep their data local and private. In this framework, clients train their own models and securely share only the model weights with the server, which aggregates them to train a global model without the need for centralized data collection.27 Although FL shows potential in enhancing OOD robustness, it also introduces challenges, such as parameter leakage, the potential for reconstructing training data from model weights, dataset heterogeneity, communication overhead, and practical deployment issues. These concerns must be resolved to fully realize FL's potential in overcoming OOD challenges in glaucoma diagnoses.27 
Another emerging framework in the medical domain to address the lack of OOD generalization in standard DL models is uncertainty quantification (UQ).2933 In particularly, UQ has been applied to various glaucoma diagnosis-related tasks, to improve OOD robustness, including uncertainty-informed multi-source learning,34,35 multi-task learning,36 and optic cup and disc segmentations.37 For instance, Araojo et al.23 explored a UQ framework for glaucoma classification, specifically focusing on OOD detection using artificially created OOD data. However, their approach does not directly address the performance degradation of standard DL models on OOD data. Similarly, Wang et al.24 developed an uncertainty-informed glaucoma classifier that self-censors high-uncertainty predictions at the image level to mitigate performance degradation on OOD data. Yet, this work does not incorporate self-censorship at the dataset level. Furthermore, neither study validated their approaches across publicly available benchmark glaucoma datasets. 
In this work, we propose a novel unified approach that addresses standard DL performance degradation on OOD data. Our pipeline is designed to detect OOD cases at both the dataset and image levels, and to refrain from making predictions in those cases rather than producing incorrect predictions like standard DL models. Our method leverages a novel UQ framework to perform OOD detection, which constitutes dataset-level self-censorship by quantifying how much a dataset deviates from the training data based on the model’s uncertainty in its predictions. This allows users to decide whether to entirely self-censor that dataset. We validated our dataset-level self-censorship method on 14 publicly available external datasets, including glaucoma datasets, fundus other retinal disease datasets, and non-medical natural image datasets. Extending from this dataset-level pipeline, we incorporate image-level self-censorship by tuning thresholds on the receiver operating characteristic (ROC) curve of predicted probabilities to automatically censor individual images, optimizing performance. Our methods are nascent yet common in medical diagnosis that aim to reduce misdiagnosis risk.29,30,32,33,36,3840 
In summary, our work makes the following contributions: (1) presenting a tunable DL-based uncertainty quantification approach that unifies dataset-level and image-level self-censorship for glaucoma classification by balancing the trade-off between performance and data retention, thereby enhancing OOD robustness, (2) addressing performance degradation and lack of OOD generalization in standard DL models for glaucoma classification, validated on 4 external glaucoma test datasets, and (3) enhancing standard DL model performance in OOD detection, validated on 14 publicly available glaucoma, other retinal disease fundus, and non-fundus image datasets. The study source code is publicly available at https://github.com/homairs/UQ_SelfCensorship.git
Methods
Internal In-Distribution Data Source
We trained all DL models solely on a university-based hospital dataset of fundus images from the Illinois Eye and Ear Infirmary of the University of Illinois Chicago (UIC), which we will refer to as the UIC dataset. Therefore, the internal UIC dataset is considered as the reference in-distribution (IND) data. The UIC data were anonymized, and the study was approved by an internal institutional review board and adhered to the tenets of the Declaration of Helsinki. We used billing information (International Classification of Diseases [ICD] codes) as binary glaucoma labels. We randomly selected 361 images from each class where non-glaucoma images were defined as those without diagnosed glaucoma or identified as glaucoma suspects without optic nerve head damage. We split the data (n = 722) into train, validation, and test sets, comprising 70% (n = 520), 10% (n = 62), and 20% (n = 140) images, respectively. Table 1 presents an overview of the UIC dataset along with visualizing a random sample of fundus images in this dataset. 
Table 1.
 
Internal Dataset and Fourteen External Datasets Used for Training and Validation in This Study, Respectively
Table 1.
 
Internal Dataset and Fourteen External Datasets Used for Training and Validation in This Study, Respectively
External OOD Data Source
We validated the models on 14 external datasets, whose distributions relative to the in-distribution UIC dataset are unknown, making it unclear to what extent they are OOD. These datasets are publicly available benchmark image classification datasets, comprising four with glaucoma labels, five with fundus images without glaucoma diagnoses, and five non-medical natural image datasets representing extreme OOD cases for glaucoma models. Table 1 provides an overview of each glaucoma and fundus dataset and displays a random sample of these images. Additional details and visualizations for all three sets of datasets can be found in Section 1.1 of the Supplementary Materials
Standard DL Model for Glaucoma Classification
In Figures 1A and 1B, the standard DL framework for glaucoma classification is illustrated. As shown, the model utilizes a VGG-16 architecture initialized with ImageNet-pretrained41 weights as a feature extractor backbone. We fine-tuned the entire network on fundus images to optimize performance specifically for glaucoma classification. The pretrained weights serve as a warm start, aiding optimization, but the model undergoes further training to adapt to the domain-specific features of fundus images. A detailed description of the implementation process can be found in the Implementation Details. As shown in Figure 1B, the standard DL model uses the softmax function to calculate the predicted class probabilities. The softmax function outputs a point estimate for class probabilities which only shows the model's uncertainty about different classes but cannot reflect the uncertainty due to the lack of knowledge of a given image.25,41 Further, the standard DL model uses a cross-entropy loss function. 
Figure 1.
 
Glaucoma classification pipeline. (A) Feature extractor backbone VGG-16 architecture, (B) standard DL framework using softmax function, and (C) Dirichlet model framework using evidential deep learning for uncertainty quantification.
Figure 1.
 
Glaucoma classification pipeline. (A) Feature extractor backbone VGG-16 architecture, (B) standard DL framework using softmax function, and (C) Dirichlet model framework using evidential deep learning for uncertainty quantification.
Uncertainty-Informed Model for Glaucoma Classification
Figures 1A to 1C depicts the uncertainty-informed glaucoma classifier, inspired by the Shen et al. method.41 Unlike the standard DL model, which gives a single probability for each class, the uncertainty-informed classifier provides a range of possible probabilities for each class. In a binary classification scenario, it parameterizes a Beta distribution over these probabilities, meaning each possible pair of class probabilities is associated with a likelihood. Generally, this framework can be extended to more than two classes, where it uses a Dirichlet distribution instead of a Beta distribution. We refer to our proposed uncertainty-informed model as the Dirichlet model because this generalized framework is agnostic to number of classes. As depicted in Figure 1A, the Dirichlet model utilizes the same VGG-16 architecture as the standard DL baseline, initialized with ImageNet-pretrained weights. However, like the standard DL model, the network is further fine-tuned on fundus images. A detailed description of the implementation process can be found in the Implementation Details. In this architecture, three linear classifiers, denoted as g1, g2, and g3 are attached to the feature representations ϕ12, and ϕ3 respectively, from the last three final layers of the feature extractor backbone, as indicated in Figure 1C. A final fully connected layer then processes the combined features from g1, g2, and g3 followed by batch normalization and ReLU activation, to predict α(x), the concentration parameter of the Dirichlet distributions. The Dirichlet model uses an Evidence Lower Bound (ELBO) loss, whose derivation can be found in Appendix A of Shen et al.41 
OOD Detection: Dataset-Level Self-Censorship
Alongside glaucoma classification, discussed in the Uncertainty-Informed Model for Glaucoma Classification section, our Dirichlet model is optimized to maximize the area under the curve (AUC) between training IND data and synthetically generated OOD data, effectively training the model to recognize OOD data based on its uncertainty scores. Figure 2 illustrates our Dirichlet model's OOD detection pipeline, building upon the method by Shen et al.41 Specifically, during the training process for the OOD detection task, we created a noisy validation set by adding noise to the original validation samples, treating these noisy samples as OOD data. To generate this noisy validation set, various noise techniques, such as gaussian blur, pixel permutation, and contrast rescaling noise types, were applied. We then computed uncertainty scores for both the original and noisy samples using the Dirichlet model, aiming to maximize the AUC for distinguishing between the original and noisy validation samples, thereby sensitizing the model to OOD data. 
Figure 2.
 
Dirichlet OOD detection training pipeline. Dirichlet's OOD detection is optimized to maximize the AUC between IND data and synthetically generated OOD data using noise, effectively training the model to recognize OOD data based on its uncertainty scores.
Figure 2.
 
Dirichlet OOD detection training pipeline. Dirichlet's OOD detection is optimized to maximize the AUC between IND data and synthetically generated OOD data using noise, effectively training the model to recognize OOD data based on its uncertainty scores.
For both the Dirichlet and standard DL models, OOD detection is performed at inference. Given a new test dataset, we calculate the uncertainty scores and evaluate the AUC relative to the IND data. Our approach does not impose a fixed threshold on the AUC values to classify a test dataset as OOD or IND. Instead, it provides a flexible framework where users can determine the OOD status of a test dataset based on the sensitivity requirements of their diagnostic applications and other decision-making factors. Further implementation details can be found in the Implementation Details
Glaucoma Classification: Image-Level Self-Censorship
We extend our dataset-level self-censorship model to the image-level for improved OOD sample management in glaucoma classification. This enhancement aims to refine classification accuracy by censoring unreliable predictions at the image level. Our approach incorporates a post-processing step into both the standard DL and Dirichlet model, shown in Figures 1B and 1C. Figure 3 schematically illustrates our proposed image-level self-censorship method. The process begins by splitting each OOD dataset into a 90% test set and a 10% validation set. Our method identifies the minimal subset of images whose removal maximizes accuracy on the validation set. To achieve this, we compute ROC curves on the validation set using predicted probability values from either a standard DL model, a Dirichlet model, or both. In the individual self-censorship setting, the scoring function from a single model (either standard DL or Dirichlet) is used to guide censorship. In the selective self-censorship setting, we evaluate scoring functions from both models and select the one with the highest AUC on the validation set to guide censorship. This best-performing scoring function is then used to define thresholds for censoring uncertain predictions. The selected scoring function is represented by the blue ROC curve in Figure 3
Figure 3.
 
Our image-level selective self-censorship pipeline. It utilizes an ensemble of ROC curves generated from predicted probabilities of both standard DL and Dirichlet models. By tuning thresholds on the ROC curve associated with the best AUC in the validation phase, the method identifies optimal thresholds to accept or reject validation images, thereby maximizing accuracy and enhancing performance during glaucoma versus non-glaucoma inference on test data.
Figure 3.
 
Our image-level selective self-censorship pipeline. It utilizes an ensemble of ROC curves generated from predicted probabilities of both standard DL and Dirichlet models. By tuning thresholds on the ROC curve associated with the best AUC in the validation phase, the method identifies optimal thresholds to accept or reject validation images, thereby maximizing accuracy and enhancing performance during glaucoma versus non-glaucoma inference on test data.
Thresholds are then tuned on this ROC curve to remove predictions that optimize validation glaucoma classification accuracy. This process consists of five key steps. First, we traverse all possible intervals formed by different threshold combinations on the highest AUC ROC curve. For each interval, we remove the subset of images associated with predicted probabilities within that range. Next, predicted probabilities below the lower limit of each interval are mapped to 0, whereas those above the upper limit are mapped to 1. The glaucoma classification accuracy and the percentage of retained data are then computed for each interval. The optimal interval is selected based on a trade-off between maximizing accuracy and retaining the highest percentage of data. 
As indicated in Figure 3, to refine this selection, we define two key regions of interest. Region 1 consists of data removals that result in an accuracy of at least 80%, whereas region 2 includes data removals that still retain at least 80% of the total dataset. The intersection of these two regions represents the subset of data that, when removed, ensures both at least 80% accuracy and 80% data retention. Within this overlap, the point achieving the highest accuracy is identified as the optimal subset for removal. If no intersection exists between region 1 and region 2, we prioritize maximizing accuracy within region 1 and select the first point along this path as the removal subset. The definitions of these regions can be adjusted based on specific application requirements, allowing for a flexible optimization framework that balances accuracy and data retention according to user needs. These optimal thresholds are applied to the best-performing scoring function on the test set to censor predictions falling within the identified range. This image-level self-censorship aims to enhance the robustness of glaucoma classification by more accurately managing predictions at the individual image level. 
Experimental Setup
Our study is organized around two tasks: (1) dataset-level OOD detection and (2) glaucoma classification. In both tasks, models are trained exclusively on the internal UIC dataset and evaluated on external datasets to assess generalization. For the OOD detection task, we compare the performance of the standard DL model and the Dirichlet model across 14 external datasets, which include 4 glaucoma datasets, 5 fundus image datasets from other retinal diseases, and 5 natural image datasets. 
For the glaucoma classification task, we compare the performance of the standard DL, Dirichlet, standard DL with image-level self-censorship, Dirichlet with image-level self-censorship, and selective self-censorship across internal UIC test set and the four external glaucoma datasets. 
Evaluation Metrics
Max Probability (MaxP): The maximum predicted class probability by either standard DL or Dirichlet model. We used MaxP as a measure of models’ confidence in their most confident prediction.41 Larger MaxP shows more confidence. 
Uncertainty (u): We used the standard entropy metric to measure models’ uncertainty.41 The standard DL model’s uncertainty was assessed using Shannon entropy metric, in Equation 1, which provides a measure of the uncertainty inherent in the predicted class probabilities , where pi represents each class probability. The Dirichlet model's uncertainty was measured using the differential entropy metric, in Equation 2, which calculates the entropy of the Dirichlet distribution over the predicted labels. Higher entropy values indicate greater uncertainty, where αi, Γ, ψ, and K denote the concentration parameter of the Dirichlet distribution, the gamma function, digamma function, and the number of classes, respectively. For the standard DL model, higher Shannon entropy indicates greater uncertainty in the predicted class, whereas in the Dirichlet model, higher differential entropy reflects greater uncertainty in the predicted label distribution.  
\begin{equation} {{H}_S} = - \mathop \sum \limits_{i = 1}^N {{p}_i}{\rm{log}} \left( {{{p}_i}} \right)\end{equation}
(1)
 
\begin{eqnarray} &&{{H}_D} = \log \left( {\frac{{\mathop \prod \nolimits_{i = 1}^K {\rm{\Gamma }}({{\alpha }_i})}}{{{\rm{\Gamma }}\left( {\mathop \sum \nolimits_{i = 1}^K {{\alpha }_i}} \right)}}} \right) + \left( {{{\alpha }_0} - K} \right)\psi \left( {{{\alpha }_0}} \right) \nonumber\\ &&\qquad - \mathop \sum \limits_{i = 1}^K \left( {{{\alpha }_i} - 1} \right)\psi \left( {{{\alpha }_i}} \right),\ \ {{\alpha }_0} = \mathop \sum \limits_{i = 1}^K {{\alpha }_i}. \end{eqnarray}
(2)
 
Wasserstein Distance (WD): We use the common WD metric to quantify the distance between two distributions. Specifically, in the dataset-level self-censorship task, we used WD to measure the distance between the predicted uncertainty on OOD data (uOOD) and the predicted uncertainty on IND data (uIND). Larger WD shows higher disparity. 
Area Under the Curve (AUC): AUC measures the ability of the model to distinguish between classes. It is the area under the ROC curve, which plots the true positive rate against the false positive rate at various threshold settings. Larger AUC indicates better performance. 
Accuracy (Acc): Accuracy is the ratio of correctly predicted instances to the total instances in the dataset. It provides a straightforward measure of the model’s overall predictive performance. Higher accuracy indicates better performance. 
Percentage Data (%Data): Percentage data shows the proportion of the dataset retained after applying the self-censorship approach. This metric indicates the percentage of data considered reliable enough to be retained for analysis or classification. Higher %Data signifies less data being censored. 
Implementation Details
Models were implemented in PyTorch. We conducted 30 hyperparameter searches for each method (standard DL and Dirichlet). The best standard DL model was selected based on the maximum validation accuracy across hyperparameters. The best Dirichlet model was selected based on maximum AUC for OOD detection and maximum validation accuracy for glaucoma classification across hyperparameters. We fine-tuned both the standard DL and Dirichlet models’ feature extractor backbones, initialized with ImageNet-pretrained VGG-16 weights, on fundus images for the specific tasks of glaucoma classification and OOD detection. The models were optimized using the Cross-Entropy loss function for the standard DL models and Evidence Lower Bound loss function for the Dirichlet models with the Stochastic gradient descent (SGD) algorithm. Images were resized to 224 × 224 pixels. The hyperparameters included learning rate ({10−2, 10−3, and 10−4}), batch size ({32, 64, and 128}), weight decay for l2 regularization ({0, 10−5, 10−4, 5 × 10−4, and 5 × 10−4}), and data augmentation techniques ({random horizontal flip, random rotation, random translation}). Rotation angle was randomly selected from {0°, 90°, 180°, and 270°}, and images were randomly translated by up to 65 pixels in both x and y directions. During the training of the Dirichlet model's OOD detection, both the types and levels of noise were treated as hyperparameters and randomly selected, similar to data augmentation. The optimal noise parameters included up to 10% randomly permuted pixels, Gaussian blur with a standard deviation randomly sampled from the range from 1 to 3, and contrast rescaling with a sigmoid argument randomly sampled from the range 5 to 25. 
Results
OOD Detection: Dataset-Level Self-Censorship
Table 2 presents the OOD detection performance of the Dirichlet model compared to the standard DL model across nine external datasets with unknown distributions, in comparison to the in-distribution UIC data. As shown in Table 2, the Dirichlet model outperforms the standard DL model on all nine datasets in classifying the external datasets as OOD. The Dirichlet model achieves average mean AUC improvements of 15.4%, and 15.9% across glaucoma datasets, and fundus datasets, respectively. For glaucoma datasets, the Dirichlet model achieves a mean AUC (95% confidence interval [CI]) ranging from 62.1% (95% CI = 61.9% to 62.3%) to 82.1% (95% CI = 81.6% to 82.6%), whereas the standard DL model achieves a mean AUC (95% CI) ranging from 42.4% (95% CI = 42.1% to 42.7%) to 69.2% (95% CI = 68.6% to 69.9%). For fundus datasets, the Dirichlet model achieves a mean AUC (95% CI) ranging from 67.7% (95% CI = 67.5% to 68.0%) to 99.8% (95% CI = 99.7% to 99.8%, whereas the standard DL model achieves a mean AUC (95% CI) ranging from 65.1% (95% CI = 64.7% to 65.6%) to 84.3% (95% CI = 83.9% to 85.1%). Moreover, as indicated in Table 2, the Dirichlet model shows 96.2%, and 97.3% higher Wasserstein distances between the uncertainties of OOD versus IND datasets, WD(uIND, uOOD), averaged across glaucoma datasets and fundus datasets, respectively. Further, Supplementary Table S3 in the Supplementary Materials presents OOD detection results on natural image datasets. Supplementary Figure S3 and Supplementary Table S7 of the Supplementary Materials, further, elaborate on correlation analysis results among WD(uIND, uOOD), OOD detection accuracy, glaucoma classification accuracy, mean uncertainty, and confidence for each model. 
Table 2.
 
Comparison of Out-of-Distribution (OOD) Classification AUC [95% Confidence Interval] Between Standard DL and Dirichlet Models Across Nine External Datasets
Table 2.
 
Comparison of Out-of-Distribution (OOD) Classification AUC [95% Confidence Interval] Between Standard DL and Dirichlet Models Across Nine External Datasets
Glaucoma Classification: Image-Level Self-Censorship
Figure 4 shows the trade-off between accuracy and retained data percentage on the validation set using our proposed image-level selective self-censorship approach. It highlights the minimal subset of images whose removal maximizes accuracy across external glaucoma datasets (RIMONE-DL, O-RIGA, REFUGE, and LAG). The purple points represent the efficient frontier, where no improvement in one objective can be made without sacrificing the other, with arrows indicating the optimal accuracy-data balance for each dataset. 
Figure 4.
 
Trade-off between validation accuracy and percentage of censored data during threshold tuning using the proposed image-level selective self-censorship method across four datasets: (A) RIMONE-DL, (B) O-RIGA, (C) REFUGE, and (D) LAG. Each point represents a threshold setting, showing the corresponding accuracy and percentage of censored images. Purple points indicate the efficient frontier, where improving accuracy requires sacrificing data retention. Arrows mark the optimal accuracy–retention trade-off for each dataset.
Figure 4.
 
Trade-off between validation accuracy and percentage of censored data during threshold tuning using the proposed image-level selective self-censorship method across four datasets: (A) RIMONE-DL, (B) O-RIGA, (C) REFUGE, and (D) LAG. Each point represents a threshold setting, showing the corresponding accuracy and percentage of censored images. Purple points indicate the efficient frontier, where improving accuracy requires sacrificing data retention. Arrows mark the optimal accuracy–retention trade-off for each dataset.
Table 3 compares glaucoma classification accuracy versus data retention for the standard DL model, Dirichlet model, and image-level self-censorship across glaucoma datasets. As indicated in Table 3, on the IND UIC dataset, both the standard DL and Dirichlet models achieve 85.3% accuracy. For the external glaucoma datasets, the standard DL model achieves accuracies of 55.3% on RIMONE-DL, 73.0% on O-RIGA, 78.5% on REFUGE, and 76.3% on LAG. In comparison, the Dirichlet model achieves accuracies of 52.5%, 66.3%, 71.9%, and 77.1% on these same datasets, respectively. Both models retain 100% of the data across these datasets. As indicated in Tables 3 and 4, the image-level selective self-censorship improves the classification accuracies by 6.6%, 11.3%, 19.2%, 25.0%, and 13.4%, to 90.9%, 60.0%, 83.0%, 94.0%, and 87.0% on the UIC, RIMONE-DL, O-RIGA, REFUGE, and LAG datasets, respectively, whereas retaining 85%, 63%, 80%, 66%, and 78% of the data. Further, we presented other evaluation metrics across these datasets for each model in Supplementary Table S2 of the Supplementary Materials and compared the predicted uncertainties with glaucoma classification accuracy for the standard DL versus Dirichlet baselines on OOD versus IND data in Supplementary Figure S2 of the Supplementary Materials. 
Table 3.
 
Comparison of Glaucoma Classification Accuracy (ACC) and Percentage of Remaining Data (%Data) for the Standard DL, Dirichlet, and Image-Level Self-Censorship Models on Glaucoma Datasets
Table 3.
 
Comparison of Glaucoma Classification Accuracy (ACC) and Percentage of Remaining Data (%Data) for the Standard DL, Dirichlet, and Image-Level Self-Censorship Models on Glaucoma Datasets
Further, Table 5 compares glaucoma classification accuracy on the same retained data by our selective self-censorship model against the standard DL and Dirichlet baseline models across glaucoma datasets. The selective self-censorship model achieves the average improvements of 0.0%, 6.8%, 13.7%, 9.7%, and 3.3% over the baselines without self-censorship on internal UIC and external RIMONE-DL, O-RIGA, REFUGE, and LAG datasets, as indicated in Table 4
Table 4.
 
Comparison of Accuracy Improvement in Glaucoma Classification on 100% Data Versus the Same Retained Data Evaluated by our Selective Self-Censorship Model Against the Baselines Without Self-Censorship
Table 4.
 
Comparison of Accuracy Improvement in Glaucoma Classification on 100% Data Versus the Same Retained Data Evaluated by our Selective Self-Censorship Model Against the Baselines Without Self-Censorship
Table 5.
 
Comparison of Glaucoma Classification Accuracy on the Same Uncensored Data Evaluated by our Selective Self-Censorship Against the Baselines Without Self-Censorship
Table 5.
 
Comparison of Glaucoma Classification Accuracy on the Same Uncensored Data Evaluated by our Selective Self-Censorship Against the Baselines Without Self-Censorship
Additionally, Table 4 shows that the average accuracy of the individual baseline models increases by 5.6%, 2.3%, 3.4%, 10.5%, and 7.6% on the censored subset of images evaluated by the selective self-censorship method across the internal UIC dataset and external datasets (RIMONE-DL, O-RIGA, REFUGE, and LAG), respectively. 
Discussion
Glaucoma is a severe and pervasive eye condition, ranking among the leading causes of irreversible blindness worldwide. Timely detection is crucial for preventing its progression. With advancements in DL, many researchers have leveraged standard DL models to automate glaucoma diagnosis using fundus images. Despite achieving over 90% accuracy, the standard DL models often lack robustness across different datasets. Most of these models are trained and tested on the same dataset, which limits their generalizability. In real-world medical imaging, training and test data often differ due to shifts in data distribution at the inference stage, known as the OOD challenge. These shifts can arise from variations in data collection protocols, institutions, devices, patient populations, or even differences in diseases seen during training. The OOD challenge persists due to the lack of a universal dataset that captures all these real-world variations (e.g. race, ethnicity, age, diseases, imaging devices, and protocols). Collecting such a dataset is hindered by barriers like data-sharing limitations, patient privacy concerns, and regulatory issues. Consequently, algorithmic advancements are crucial for enhancing the generalizability of standard DL models and addressing performance degradation on OOD data. For DL models to be effectively translated into clinical settings for automated glaucoma diagnosis, they need to be robust and generalizable against OOD shifts to minimize risks to patient safety. 
Accurate glaucoma diagnosis is particularly challenging due to high false-positive and false-negative rates, variability in expert grading, and difficulties in generalizing deep learning models to OOD data. These challenges can lead to misdiagnoses or unnecessary referrals, impacting both patient outcomes and clinical workflows. In medical diagnosis, the cost of incorrect predictions often outweighs the burden of withholding a decision altogether, making selective prediction a safer approach. Therefore, we proposed a self-censorship approach that addresses these limitations by ensuring that only high-confidence predictions are acted upon while uncertain cases are referred to specialists. Our method detects OOD cases (i.e. uncertain cases) at both the dataset-level and image-level to avoid producing incorrect predictions like standard DL models. This trade-off ensures that when the model does make a prediction, it is more reliable. Our proposed self-censorship approach addressed the significant performance degradation and poor generalization of the state-of-the-art standard DL models on OOD data in glaucoma classification. As illustrated in Figure 1, our method extends the approach by Shen et al.,41 leveraging uncertainty quantification, and is referred to as the Dirichlet model. Our dataset-level self-censorship performs OOD detection by quantifying how much an external dataset deviates from the training data, allowing the user to decide whether to censor entire datasets that significantly differ from the training set, as illustrated in Figure 2. At a finer scale, our image-level self-censorship approach focuses on self-censoring OOD individual images, as illustrated in Figure 3, thereby enhancing the accuracy of glaucoma classification. These self-censorship approaches ensure that both broad data shifts and individual uncertain images are addressed, improving the robustness and reliability of the DL model in glaucoma diagnosis. 
For dataset-level self-censorship, we performed OOD detection optimized on the IND validation data and evaluated it on 14 publicly available datasets, including those with glaucoma labels, other retinal disease fundus data, and non-medical natural images representing extreme OOD cases for glaucoma models, comparing it to the widely used standard DL model. For image-level self-censorship, we proposed a post-processing approach that determines thresholds on the ROC curves of the predicted probabilities from either the standard DL model, the Dirichlet model, or both. In the latter case, the scoring function with the highest AUC on a validation set is selected (selective self-censorship). At the inference, the optimal thresholds are applied to the test set. 
For dataset-level self-censorship, our results in Table 2 and Supplementary Table S3 showed that the Dirichlet model consistently outperforms the standard DL model across all 14 external OOD datasets. On average, it achieved 15.4%, 15.9%, and 5.2% higher AUC scores in the OOD detection task on glaucoma, fundus, and natural image datasets, respectively. As shown in Supplementary Table S5 and Supplementary Figure S1 of the Supplementary Materials, the standard DL model exhibits significant overconfidence on OOD datasets, erroneously labeling 36.32%, 17.38%, 17.71%, 16.81%, and 11.68% of natural images from CIFAR-10, Omniglot, F-MNIST, SVHN, and KMNIST datasets as glaucoma or non-glaucoma, despite their lack of medical relevance. In contrast, the Dirichlet model, on average, assigns only 0.03% of these images with such confidence. Consequently, the standard DL model's lack of OOD-awareness leads to reduced OOD-robustness. Additionally, Supplementary Table S3 indicates that the standard DL model has up to 7.1% error in detecting significant shifts from glaucoma datasets to unrelated natural images. This overconfidence in diagnosing glaucoma on clear OOD datasets with substantial distribution shifts poses risks to patient safety. The Dirichlet model, with its more calibrated MaxP distribution, effectively mitigates these risks by improving detection of distribution shifts, as shown in Table 2, Supplementary Table S3, and Supplementary Table S5
For image-level self-censorship, our results in Table 3 showed that applying self-censorship to either model individually improved glaucoma classification performance compared to using either model individually without self-censorship. As summarized in Table 3, the average performance of the individual standard DL and Dirichlet models degraded by 36.2%, 20.2%, 12.1%, and 11.0% on the OOD datasets, respectively, compared to their average performance on the IND UIC dataset. Notably, applying selective self-censorship to improved accuracy by 10.13% on UIC, 9.58% on RIMONE-DL, 21.25% on O-RIGA, 27.23% on REFUGE, and 14.92% on LAG datasets, averaged across the individual models without self-censorship. This improvement, however, came at the cost of censoring 15%, 37%, 20%, 34%, and 22% of data from the IND UIC and OOD RIMONE-DL, O-RIGA, REFUGE, and LAG datasets, respectively, with the highest censorship on RIMONE-DL, likely due to its ONH-cropped images differing from the full-view UIC fundus images used in training, shown in Table 1. Exclusively on the attained uncensored data selected by the ensemble + self-censorship model, Table 4 compares the performance of the proposed image-level self-censorship against baselines. On this subset of data, our self-censorship model achieves 0.0%, 6.8%, 13.7%, 9.7%, and 3.3% higher accuracy than the averaged accuracy of the standard DL and Dirichlet baselines across the internal UIC, and external RIMONE-DL, O-RIGA, REFUGE, and LAG datasets, respectively. Additionally, even individual baseline models benefit from self-censorship, with their accuracies increasing by 5.6%, 2.3%, 3.4%, 10.5%, and 7.6% on these datasets, after uncertain cases are removed. Thus, the image-level self-censorship approach balances improved glaucoma classification accuracy with the percentage of retained data. 
Furthermore, Section 1.4 of the Supplementary Materials provides a detailed analysis of the impact of noise type and level on Dirichlet's OOD detection performance. As shown in Supplementary Figure S6, both factors significantly influence performance, with their effects varying across datasets. Additionally, Section 1.3 of the Supplementary Materials presents correlation results between Dirichlet and standard DL performances for both glaucoma classification and OOD detection, including their mean predicted uncertainty, confidence scores, and distance scores across OOD datasets. As shown in Supplementary Figure S3 of the Supplementary Materials, we demonstrated that, unlike the Dirichlet model, the standard DL model does not show strong correlations between the extent of OOD-ness of a dataset (i.e. WD(uIND, uOOD)) with predicted uncertainty or confidence (MaxP). Specifically, as data become more OOD, the standard DL model exhibits lower uncertainty and lower MaxP on average. In contrast, the Dirichlet model tends to show higher uncertainty and lower MaxP as datasets become more OOD compared to the training data. Further, we showed in Supplementary Figure S3 that as data become more OOD to the training data, the standard DL model’s OOD detection accuracy decreases, whereas its glaucoma classification accuracy increases. Conversely, as a dataset becomes more OOD to the training data, the Dirichlet model’s OOD detection accuracy increases, whereas its glaucoma classification accuracy decreases. Therefore, the Dirichlet model assigns higher uncertainty values to datasets with greater shifts in data distribution compared to the training data, effectively detecting those datasets as OOD, although this shift in data distribution adversely affects its glaucoma classification accuracy. 
Although our study proposed the self-censorship framework as an alternative to the standard DL framework, addressing their significant performance degradation and lack of OOD generalization, it does come with limitations. Our self-censorship approach leverages Dirichlet model, which is challenging to train as it must simultaneously balance two objectives: glaucoma classification and OOD detection, unlike the standard DL model which only optimizes for glaucoma classification. Additionally, whereas our self-censorship approach accurately identifies OOD data, it does not explicitly identify the factors contributing to a dataset or an image being OOD. Future research should aim to investigate these factors to provide deeper insights into the nature of OOD data and further improve model interpretability. Our approach also does not guarantee generalization to all possible OOD cases, as no existing method can. Additionally, there is currently a lack of a comprehensive dataset containing all possible variations in the research landscape, which limits DL models from generalizing across all potential variabilities in the test set. Providing such a dataset would eliminate OOD concerns, making specialized OOD-handling methods unnecessary. 
Although self-censorship excludes a portion of the data, it introduces a trade-off between accuracy and data retention. Our method reduces the risk of incorrect predictions by withholding outputs for cases where the model is uncertain. This behavior is tunable, allowing users to adjust the balance between prediction confidence and data coverage; retaining more data may yield a modest accuracy gain, while prioritizing higher confidence requires discarding a greater share of predictions. However, we acknowledge that achieving high accuracy thresholds (e.g. ≥95%) currently requires censoring a substantial portion of the data which may increase the burden of manual review. As such, the current approach may not yet be clinically viable. 
Despite this key limitation, our method improves over state-of-the-art DL baselines for glaucoma classification by reducing the number of incorrect automated predictions. This reallocation of effort, from correcting misclassifications to reviewing uncertain cases, offers a different framing of model deployment, especially in scenarios where workload and reliability must be balanced. We also recognize the potential for bias in self-censorship, as certain subgroups (e.g. age, gender, and race) may be disproportionately affected due to training data imbalances. Addressing this will require more diverse datasets and targeted analyses of censored cases. 
Although our current method is not yet ready for clinical adoption, it offers a foundation for future work aimed at improving the accuracy–data retention trade-off. Advancing this balance will be critical for making such selective prediction frameworks more practical in real-world applications. 
Conclusions
Standard DL models often struggle with performance degradation when encountering OOD data, a common challenge in real-world medical imaging. In medical diagnosis, where reliability is critical, the potential harm of incorrect predictions can outweigh the drawbacks of withholding uncertain decisions. Therefore, in this study, we proposed a unified self-censorship framework based on deep evidential UQ, specifically for glaucoma classification. Our method detects OOD data at both the dataset and image levels, deliberately refraining from making predictions on highly uncertain cases prone to errors. We demonstrated that our approach enhances standard DL performance, improving dataset-level OOD detection by up to 27.5% across 14 external image datasets. Additionally, our image-level self-censorship improved glaucoma classification accuracy by up to 25.0% across 4 external glaucoma datasets, addressing the lack of generalizability in standard models. Our approach introduces a tunable trade-off between accuracy and data retention and demonstrates improved performance over state-of-the-art baselines for glaucoma classification. Whereas this framework opens a novel direction for adaptive selective prediction in medical imaging, further work is needed to enhance its clinical applicability and facilitate real-world translation. 
Acknowledgments
Supported by an unrestricted grant from Research to Prevent Blindness and a generous donation from the Cless Family Foundation. 
Disclosure: H. Rashidisabet, None; R.V.P. Chan, None; Y.I. Leiderman, Alcon (C), Genentech (C), Regeneron (C), RegenXBio (C), Microsurgical Guidance Solutions (O); T.S. Vajaranant, None; D. Yi, None 
References
Tham YC, Li X, Wong TY, Quigley HA, Aung T, Cheng CY. Global prevalence of glaucoma and projections of glaucoma burden through 2040 a systematic review and meta-analysis. Ophthalmology. 2014; 121(11): 2081–2090. [CrossRef] [PubMed]
Parihar JKS . Glaucoma: the ‘black hole’ of irreversible blindness. Med J Armed Forces India. 2016; 72(1): 3–4. [CrossRef] [PubMed]
Wang W, He M, Li Z, Huang W. Epidemiological variations and trends in health burden of glaucoma worldwide. Acta Ophthalmol. 2019; 97(3): e349–e355. [PubMed]
Alawad M, Aljouie A, Alamri S, et al. Machine learning and deep learning techniques for optic disc and cup segmentation – a review. Clin Ophthalmol. 2022; 16(February): 747–764. [PubMed]
Camara J, Neto A, Pires IM, Villasana MV, Zdravevski E, Cunha . Literature review on artificial intelligence methods for glaucoma screening, segmentation, and classification. J Imaging. 2022; 8: 19.
Phasuk S, Tantibundhiti C, Poopresert P, et al. Automated glaucoma screening from retinal fundus image using deep learning. Annu Int Conf IEEE Eng Med Biol Soci. 2019; 2019: 904–907.
Nakahara K, Asaoka R, Tanito M, et al. Deep learning-assisted (automatic) diagnosis of glaucoma using a smartphone. Br J Ophthalmol. 2022; 106(4): 587–592. [CrossRef] [PubMed]
Cho H, Hwang YH, Chung JK, et al. Deep learning ensemble method for classifying glaucoma stages using fundus photographs and convolutional neural networks. Curr Eye Res. 2021; 46(10): 1516–1524. [CrossRef] [PubMed]
Mojab N, Noroozi V, Yi D, et al. Real-world multi-domain data applications for generalizations to clinical settings. arXiv Preprint. Available at: https://arxiv.org/abs/2007.12672.
Panda R, Puhan NB, Mandal B, Panda G. GlaucoNet: patch-based residual deep learning network for optic disc and cup segmentation towards glaucoma assessment. SN Comput Sci. 2021; 2(2): 1–17. [CrossRef]
Rashidisabet H, Sethi A, Jindarak P, et al. Validating the generalizability of ophthalmic artificial intelligence models on real-world clinical data. Transl Vis Sci Technol. 2023; 12(11): 8. [CrossRef] [PubMed]
Sun X, Xu Y, Tan M, et al. Localizing optic disc and cup for glaucoma screening via deep object detection networks. Presented at the Computational Pathology and Ophthalmic Medical Image Analysis: First International Workshop, COMPAY 2018, and 5th International Workshop, OMIA 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16-20, 2018, Proceedings. 2018: 236–244.
Mojab N, Noroozi V, Yu PS, Hallak JA. Deep multi-task learning for interpretable glaucoma detection. Proceedings - 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science, IRI 2019. 2019: 167–174. Available at: https://ieeexplore.ieee.org/document/8843459.
Hervella ÁS, Rouco J, Novo J, Ortega M. End-to-end multi-task learning for simultaneous optic disc and cup segmentation and glaucoma classification in eye fundus images. Appl Soft Comput. 2022; 116: 108347. [CrossRef]
Pascal L, Perdomo OJ, Bost X, Huet B, Otálora S, Zuluaga MA. Multi-task deep learning for glaucoma detection from color fundus images. Sci Rep. 2022; 12(1): 6–15. [CrossRef] [PubMed]
Fan R, Alipour K, Bowd C, et al. Detecting glaucoma from fundus photographs using deep learning without convolutions: transformer for improved generalization. Ophthalmol Sci. 2023; 3(1): 100233. [CrossRef] [PubMed]
Liao W, Zou B, Zhao R, Chen Y, He Z, Zhou M. Clinical interpretable deep learning model for glaucoma diagnosis. IEEE J Biomed Health Inform. 2020; 24(5): 1405–1412. [CrossRef] [PubMed]
Li L, Xu M, Wang X, Jiang L, Liu H. Attention based glaucoma detection: a large-scale database and CNN model. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2019; 2019-June: 10563–10572.
Chang J, Lee J, Ha A, et al. Explaining the rationale of deep learning glaucoma decisions with adversarial examples. Ophthalmology. 2021; 128(1): 78–88. [CrossRef] [PubMed]
Zhao R, Liao W, Zou B, Chen Z, Li S. Weakly-supervised simultaneous evidence identification and segmentation for automated glaucoma diagnosis. 33rd AAAI Conference on Artificial Intelligence, AAAI2019, 31st Innovative Applications of Artificial Intelligence Conference, IAAI 2019 and the 9th AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019. Published online 2019:809–816, doi:10.1609/aaai.v33i01.3301809.
Castro DC, Walker I, Glocker B. Causality matters in medical imaging. Nat Commun. 2020; 11(1): 3673. [CrossRef] [PubMed]
Meinke A, Hein M. Towards neural networks that provably know when they don't know. arXiv Preprint. 2020. Available at: https://arxiv.org/abs/1909.12180.
Araojo T, Aresta G, Bogunovic H. Deep Dirichlet uncertainty for unsupervised out-of-distribution detection of eye fundus photographs in glaucoma screening. ISBIC 2022 - International Symposium on Biomedical Imaging Challenges, Proceedings. 2022. Available at: https://ieeexplore.ieee.org/document/9854763.
Wang M, Lin T, Wang L, et al. Uncertainty-inspired open set learning for retinal anomaly identification. Nat Commun. 2023; 14(1): 6757. [CrossRef] [PubMed]
Joo T, Chung U, Seo MG. Being Bayesian about categorical probability. 37th International Conference on Machine Learning, ICML 2020. 2020; Part F16814: 4899–4910.
Rashidisabet H, Sethi A, Jindarak P, et al. Validating the generalizability of ophthalmic artificial intelligence models on real-world clinical data. Transl Vis Sci Technol. 2023; 12(11): 8. [CrossRef] [PubMed]
Ran AR, Wang X, Chan PP, et al. Developing a privacy-preserving deep learning model for glaucoma detection: a multicentre study with federated learning. Br J Ophthalmol. 2024; 108(8): 1114–1123. [CrossRef] [PubMed]
Ozdemir O, Russell RL, Berlin AA. A 3D probabilistic deep learning system for detection and diagnosis of lung cancer using low-dose CT scans. IEEE Trans Med Imaging. 2020; 39(5): 1419–1429. [CrossRef] [PubMed]
Lee J, Shin D, Oh SH, Kim H. Method to minimize the errors of AI: quantifying and exploiting uncertainty of deep learning in brain tumor segmentation. Sensors (Basel). 2022; 22(6): 2406. [CrossRef] [PubMed]
Liu S, Liang S, Huang X, Yuan X, Zhong T, Zhang Y. Graph-enhanced U-Net for semi-supervised segmentation of pancreas from abdomen CT scan. Phys Med Biol. 2022; 67(15): 6560. [CrossRef]
Abdar M, Samami M, Dehghani Mahmoodabad S, et al. Uncertainty quantification in skin cancer classification using three-way decision-based Bayesian deep learning. Comput Biol Med. 2021; 135(January): 104418. [PubMed]
Faghani S, Moassefi M, Rouzrokh P, et al. Quantifying uncertainty in deep learning of radiologic images. Radiology. 2023; 308(2): e222217. [CrossRef] [PubMed]
Ren K, Zou K, Liu X, et al. Uncertainty-informed mutual learning for joint medical image classification and segmentation. arXiv Preprint 2023. Available at: http://arxiv.org/abs/2303.10049.
Zou K, Yuan X, Shen X, Wang M, Fu H. TBraTS: trusted brain tumor segmentation. arXiv Preprint. 2022. Available at: http://arxiv.org/abs/2206.09309.
Araújo T, Aresta G, Mendonça L, et al. DR|GRADUATE: uncertainty-aware deep learning-based diabetic retinopathy grading in eye fundus images. Med Image Anal. 2020; 63: 101715. [CrossRef] [PubMed]
Chai Y, Bian Y, Liu H, Li J, Xu J. Glaucoma diagnosis in the Chinese context: an uncertainty information-centric Bayesian deep learning model. Inf Process Manag. 2021; 58(2): 102454. [CrossRef]
Zhao R, Ma R, Zhou L, Jia X, Liu J. MS-EBDL: reliable glaucoma assessment via sufficient epistemic uncertainty. Proceedings - 2023 2023 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2023. Published online 2023:1732–1739.
Huang X, Sun J, Gupta K, et al. Detecting glaucoma from multi-modal data using probabilistic deep learning. Front Med (Lausanne). 2022; 9: 923096. [CrossRef] [PubMed]
Wundram AM, Fischer P, Wunderlich S, et al. Leveraging Probabilistic Segmentation Models for Improved Glaucoma Diagnosis: a Clinical Pipeline Approach. MIDL, 2024. Available at: https://github.com/annawundram/glaucoma-diagnosis-pipeline.
Russakovsky O, Deng J, Su H, et al. ImageNet large scale visual recognition challenge. Int J Comput Vis. 2015; 115(3): 211–252. [CrossRef]
Shen M, Bu Y, Sattigeri P, Ghosh S, Das S, Wornell G. (2023). Post-hoc uncertainty learning using a Dirichlet meta-model. Proceedings of the AAAI Conference on Artificial Intelligence. 37(8), 9772–9781. [CrossRef]
Figure 1.
 
Glaucoma classification pipeline. (A) Feature extractor backbone VGG-16 architecture, (B) standard DL framework using softmax function, and (C) Dirichlet model framework using evidential deep learning for uncertainty quantification.
Figure 1.
 
Glaucoma classification pipeline. (A) Feature extractor backbone VGG-16 architecture, (B) standard DL framework using softmax function, and (C) Dirichlet model framework using evidential deep learning for uncertainty quantification.
Figure 2.
 
Dirichlet OOD detection training pipeline. Dirichlet's OOD detection is optimized to maximize the AUC between IND data and synthetically generated OOD data using noise, effectively training the model to recognize OOD data based on its uncertainty scores.
Figure 2.
 
Dirichlet OOD detection training pipeline. Dirichlet's OOD detection is optimized to maximize the AUC between IND data and synthetically generated OOD data using noise, effectively training the model to recognize OOD data based on its uncertainty scores.
Figure 3.
 
Our image-level selective self-censorship pipeline. It utilizes an ensemble of ROC curves generated from predicted probabilities of both standard DL and Dirichlet models. By tuning thresholds on the ROC curve associated with the best AUC in the validation phase, the method identifies optimal thresholds to accept or reject validation images, thereby maximizing accuracy and enhancing performance during glaucoma versus non-glaucoma inference on test data.
Figure 3.
 
Our image-level selective self-censorship pipeline. It utilizes an ensemble of ROC curves generated from predicted probabilities of both standard DL and Dirichlet models. By tuning thresholds on the ROC curve associated with the best AUC in the validation phase, the method identifies optimal thresholds to accept or reject validation images, thereby maximizing accuracy and enhancing performance during glaucoma versus non-glaucoma inference on test data.
Figure 4.
 
Trade-off between validation accuracy and percentage of censored data during threshold tuning using the proposed image-level selective self-censorship method across four datasets: (A) RIMONE-DL, (B) O-RIGA, (C) REFUGE, and (D) LAG. Each point represents a threshold setting, showing the corresponding accuracy and percentage of censored images. Purple points indicate the efficient frontier, where improving accuracy requires sacrificing data retention. Arrows mark the optimal accuracy–retention trade-off for each dataset.
Figure 4.
 
Trade-off between validation accuracy and percentage of censored data during threshold tuning using the proposed image-level selective self-censorship method across four datasets: (A) RIMONE-DL, (B) O-RIGA, (C) REFUGE, and (D) LAG. Each point represents a threshold setting, showing the corresponding accuracy and percentage of censored images. Purple points indicate the efficient frontier, where improving accuracy requires sacrificing data retention. Arrows mark the optimal accuracy–retention trade-off for each dataset.
Table 1.
 
Internal Dataset and Fourteen External Datasets Used for Training and Validation in This Study, Respectively
Table 1.
 
Internal Dataset and Fourteen External Datasets Used for Training and Validation in This Study, Respectively
Table 2.
 
Comparison of Out-of-Distribution (OOD) Classification AUC [95% Confidence Interval] Between Standard DL and Dirichlet Models Across Nine External Datasets
Table 2.
 
Comparison of Out-of-Distribution (OOD) Classification AUC [95% Confidence Interval] Between Standard DL and Dirichlet Models Across Nine External Datasets
Table 3.
 
Comparison of Glaucoma Classification Accuracy (ACC) and Percentage of Remaining Data (%Data) for the Standard DL, Dirichlet, and Image-Level Self-Censorship Models on Glaucoma Datasets
Table 3.
 
Comparison of Glaucoma Classification Accuracy (ACC) and Percentage of Remaining Data (%Data) for the Standard DL, Dirichlet, and Image-Level Self-Censorship Models on Glaucoma Datasets
Table 4.
 
Comparison of Accuracy Improvement in Glaucoma Classification on 100% Data Versus the Same Retained Data Evaluated by our Selective Self-Censorship Model Against the Baselines Without Self-Censorship
Table 4.
 
Comparison of Accuracy Improvement in Glaucoma Classification on 100% Data Versus the Same Retained Data Evaluated by our Selective Self-Censorship Model Against the Baselines Without Self-Censorship
Table 5.
 
Comparison of Glaucoma Classification Accuracy on the Same Uncensored Data Evaluated by our Selective Self-Censorship Against the Baselines Without Self-Censorship
Table 5.
 
Comparison of Glaucoma Classification Accuracy on the Same Uncensored Data Evaluated by our Selective Self-Censorship Against the Baselines Without Self-Censorship
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×