Open Access
Artificial Intelligence  |   January 2025
Explainable Deep Learning for Glaucomatous Visual Field Prediction: Artifact Correction Enhances Transformer Models
Author Affiliations & Notes
  • Kornchanok Sriwatana
    Department of Biomedical Engineering, Faculty of Engineering, Mahidol University, Nakhon Pathom, Thailand
    Faculty of Medicine Ramathibodi Hospital, Mahidol University, Bangkok, Thailand
  • Chanon Puttanawarut
    Chakri Naruebodindra Medical Institute, Faculty of Medicine Ramathibodi Hospital, Mahidol University, Samut Prakan, Thailand
    Department of Clinical Epidemiology and Biostatistics, Faculty of Medicine Ramathibodi Hospital, Mahidol University, Bangkok, Thailand
  • Yanin Suwan
    Department of Ophthalmology, Faculty of Medicine Ramathibodi Hospital, Mahidol University, Bangkok, Thailand
  • Titipat Achakulvisut
    Department of Biomedical Engineering, Faculty of Engineering, Mahidol University, Nakhon Pathom, Thailand
  • Correspondence: Titipat Achakulvisut, Department of Biomedical Engineering, Faculty of Engineering, Mahidol University, Engineering Building 3, 999 Phuttamonthon 4 Road, Salaya, Nakhon Pathom 73170, Thailand. e-mail: [email protected] 
  • Yanin Suwan, Department of Ophthalmology, Faculty of Medicine Ramathibodi Hospital, Mahidol University, 270 Rama VI Road, Bangkok 10400, Thailand. e-mail: [email protected] 
Translational Vision Science & Technology January 2025, Vol.14, 22. doi:https://doi.org/10.1167/tvst.14.1.22
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Kornchanok Sriwatana, Chanon Puttanawarut, Yanin Suwan, Titipat Achakulvisut; Explainable Deep Learning for Glaucomatous Visual Field Prediction: Artifact Correction Enhances Transformer Models. Trans. Vis. Sci. Tech. 2025;14(1):22. https://doi.org/10.1167/tvst.14.1.22.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose: The purpose of this study was to develop a deep learning approach that restores artifact-laden optical coherence tomography (OCT) scans and predicts functional loss on the 24-2 Humphrey Visual Field (HVF) test.

Methods: This cross-sectional, retrospective study used 1674 visual field (VF)-OCT pairs from 951 eyes for training and 429 pairs from 345 eyes for testing. Peripapillary retinal nerve fiber layer (RNFL) thickness map artifacts were corrected using a generative diffusion model. Three convolutional neural networks and 2 transformer-based models were trained on original and artifact-corrected datasets to estimate 54 sensitivity thresholds of the 24-2 HVF test.

Results: Predictive performances were calculated using root mean square error (RMSE) and mean absolute error (MAE), with explainability evaluated through GradCAM, attention maps, and dimensionality reduction techniques. The Distillation with No Labels (DINO) Vision Transformers (ViT) trained on artifact-corrected datasets achieved the highest accuracy (RMSE, 95% confidence interval [CI] = 4.44, 95% CI = 4.07, 4.82 decibel [dB], MAE = 3.46, 95% CI = 3.14, 3.79 dB), and the greatest interpretability, showing improvements of 0.15 dB in global RMSE and MAE (P < 0.05) compared to the performance on original maps. Feature maps and visualization tools indicate that artifacts compromise DINO-ViT’s predictive ability but improve with artifact correction.

Conclusions: Combining self-supervised ViTs with generative artifact correction enhances the correlation between glaucomatous structures and functions.

Translational Relevance: Our approach offers a comprehensive tool for glaucoma management, facilitates the exploration of structure-function correlations in research, and underscores the importance of addressing artifacts in the clinical interpretation of OCT.

Introduction
Glaucoma, a group of progressive optic neuropathies, stands as a leading cause of irreversible blindness worldwide.1 The disease is characterized by the degeneration of retinal ganglion cells and their axons, resulting in distinctive optic disc changes and specific patterns of vision loss. Visual field (VF) testing, which identifies subtle areas of vision loss, is considered the gold standard for glaucoma monitoring.2 However, standard automated perimetry (SAP) faces challenges, such as subjectivity, variability, time-consuming nature, and frequent failure to detect early damage.35 In contrast, optical coherence tomography (OCT) offers an objective measure of retinal nerve fiber layer (RNFL) thickness, providing a more reliable indicator of glaucoma progression.6 In glaucoma management, deep learning (DL) approaches have been proposed to explore the relationship between OCT-derived structural changes and visual function. 
The advent of artificial intelligence (AI), particularly DL, has revolutionized healthcare by automating the analysis of medical images. Convolutional neural networks (CNNs), a type of DL architecture, have been widely used in this domain due to their ability to automatically learn hierarchical features from raw images. CNNs excel at tasks such as image segmentation, feature extraction, and classification, facilitating accurate disease detection, localization, and diagnosis.7,8 However, recently, transformer-based models, which were initially developed for natural language processing tasks, have shown promising results in computer vision tasks. Transformers rely on self-attention mechanisms to capture long-range dependencies and global context, enabling them to learn more robust and expressive feature representations compared with CNNs.9 Among the transformer-based models, Distillation with No Labels (DINO)10 has been successfully applied across various medical domains with the ability to handle diverse types of medical imaging data.11,12 Some medical applications include regression tasks with hematoxylin and eosin (H&E) stained histopathological images,13 disease detection in chest X-rays,14,15 and classification tasks involving brain magnetic resonance imagings (MRIs).11 Despite their potential, the application of transformer-based models in medical image analysis, particularly in the context of glaucoma assessment using OCT scans, remains largely unexplored. DINO could potentially learn robust and interpretable features from OCT scans. 
Previous studies have focused on processing OCT scans, varying them by layer (e.g. ganglion cell-inner plexiform layer [GCIPL] versus RNFL), location (macula versus optic nerve head), and instrument type (spectral domain OCT [SD-OCT] versus swept source OCT [SS-OCT]), to predict global metrics and/or pointwise visual field sensitivities assessed by 24-2 Humphrey Visual Field (HVF).1620 These studies, primarily using CNNs, have demonstrated remarkable accuracy in predicting VF defects using RNFL thickness maps. However, they have not fully addressed the negative impact of RNFL artifacts on predictive models.16,21 Furthermore, the potential of transformer-based models in this context remains largely unexplored. Transformers’ ability to capture long-range dependencies and global context could potentially lead to more accurate and robust predictions of visual field defects from OCT scans, especially in the presence of artifacts or other confounding factors. Additionally, the self-attention mechanisms of transformers could provide better interpretability by highlighting the most relevant regions in the OCT scans for predicting VF defects, which is crucial for clinical adoption and trust in AI-based systems. 
In this study, we introduce a novel DL strategy to predict 24-2 HVF from OCT-derived RNFL thickness maps. We hypothesize that preprocessing OCT scans with artifact correction could improve visual field prediction. To explore this, we sequentially developed and validated the artifact correction and visual field prediction models. First, we created the artifact correction model to restore artifact-laden scans. Then, we evaluated visual field prediction improvements after incorporating this correction step. Additionally, we compared how CNNs and Transformers benefit from artifact correction while performing the same VF prediction task. 
Materials and Methods
Data Collection
Our data collection included individuals suspected of having glaucoma, patients diagnosed with various subtypes of glaucoma, and healthy controls collected retrospectively at Ramathibodi Hospital, Bangkok, Thailand, from January 1, 2018, to August 30, 2023. The protocol was approved by the Institutional Review Board (IRB) of the Faculty of Medicine Ramathibodi Hospital, Mahidol University, Bangkok, Thailand (IRB number: MURA2023/751) and followed the principles outlined in the Declaration of Helsinki. All participants met the inclusion criteria of being aged 18 years or older and having OCT test results, with or without corresponding reliable VF examinations. 
Spectral Domain Optical Coherence Tomography
The input consists of peripapillary RNFL thickness maps obtained from the CIRRUS HD-OCT 6000 (Carl Zeiss Meditec, Dublin, CA, USA), which operates under the Optic Disc Cube 200 × 200 scanning protocol. This protocol captures a 6 × 6 mm square volume centered on the optic nerve, creating a scanned cube. Subsequently, the instrument's built-in RNFL analysis software automatically identifies the RNFL layer, quantifies its thickness at each point, and produces the final output as RNFL thickness maps. For quality control, we included only OCT scans with signal strength ≥6. We acquired OCT and VF printouts using ZEISS FORUM and saved them as PDFs. 
Visual Field Examination
The ground truth consists of 54 VF threshold values (THVs) from the 24-2 VF testing pattern. Subjects underwent testing using the Swedish Interactive Threshold Algorithm (SITA) 24-2 or 30-2 test pattern on the HVF analyzer 750i (Carl Zeiss Meditec). The testing protocol involved SITA Fast and SITA Standard, with testing occurring within 6 months of the OCT examination. For quality control, VF results required a false-positive rate of < 33%, a false-negative rate of < 33%, and fixation loss of < 20%. 
Healthy subjects were defined as those with an IOP < 22 millimeters of mercury (mm Hg), no history of increased IOP, a Glaucoma Hemifield Test (GHT) within normal limits, and a VF mean deviation (MD) and pattern standard deviation (PSD) within the 95% confidence interval (CI) of the healthy population. Glaucomatous optic neuropathy was defined by the presence of a cluster of more than 3 points depressed at a 5% level of significance on the pattern deviation plot, with one or more of these points depressed at a 1% level, excluding points on the edge of the field; a GHT result outside normal limits; and an abnormal PSD with a probability value of < 5%. All glaucomatous eyes with VF defects were detected in at least two consecutive baseline visual field tests and were consistent with a glaucomatous pattern. Glaucoma severity was classified as mild (MD > −6 decibel [dB]), moderate (−12 ≤ MD ≤ −6 dB), and severe (MD < −12 dB) based on their VF MD. To ensure a non-overlapping range of disease, subjects clinically suspected of having glaucoma were defined based on their normal VF test results with any of the following criteria: IOP of 22 to 30 mm Hg, asymmetric optic nerve head (ONH) cupping, abnormal ONH appearance, or being the contralateral eye of unilateral glaucoma. We excluded patients with recent ocular surgery or trauma, corneal or ocular media opacity, refractive errors greater than ±6.0 diopters (D), and optic neuropathies other than glaucoma. 
Dataset Preparation
To obtain input images, we automatically cropped a square region of the RNFL thickness map from the ONH and RNFL OU Analysis report. The cropped images were resized to 296 × 296 pixels for the artifact correction model and 224 × 224 pixels for the VF prediction model. These dimensions were optimized for pretrained artifact correction and prediction models, respectively. We also applied the albumentations22 UnsharpMask filter23 during the VF prediction training and evaluation, which improved model performance by sharpening the images. 
We utilized the hvf-extraction-script Python module24 to extract ground truth THVs from the 24-2 and 30-2 HVF reports. Then, we converted the visual field THVs with only 30-2 format to the 24-2 format by spatially mapping the overlapping test points between the grids. We manually verified the extracted data to ensure data accuracy. 
Model Development
OCT Scan Artifact Labeling
We labeled both paired and unpaired OCT scans as clean or artifact-laden using OpenCV’s simple thresholding on grayscale images. Pixels were classified as artifacts if their values were ≤ 0.05 × 255. A scan was labeled as artifact-laden if it contained any artifact pixels. 
Artifact Correction Model
Dataset
We collected 1511 clean scans from 840 patients as ground truth data. Of these, 822 scans were unpaired clean scans from the unpaired OCT dataset (Fig. 1A, black cube), and 689 scans were paired clean scans from the paired VF-OCT dataset (see Fig. 1A, white cube). We train the artifact correction based on clean scans with pseudo artifact masks collecting from artifact-laden scans. To produce pseudo-artifact masks, we used a dataset consisting of 945 artifact-laden scans from 625 patients. This dataset includes 757 paired and 188 unpaired artifact-laden scans. 
Figure 1.
 
Schematic representation of artifact correction and VF prediction model development. (A) Data splitting. We split the dataset of unpaired OCT and paired VF-OCT scans and allocated the resulting sets of clean and artifact-laden scans for artifact correction and VF prediction. (B) Model training and validation. Model training includes artifact correction model and VF prediction model. We trained VF prediction models using five architectures on the artifact-corrected dataset (i.e. unprocessed clean scans plus artifact-corrected artifact-laden scans). (C) Model evaluation. After inferencing, we assessed the VF prediction using RMSE, MAE with model interpretability (i.e. GradCAM, GradCAM++, and attention map). DINO, Distillation with No Labels; HVF, Humphrey Visual Field; MAE, mean absolute error; OCT, optical coherence tomography; RMSE, root mean square error; ViT, Vision Transformer; VF, visual field.
Figure 1.
 
Schematic representation of artifact correction and VF prediction model development. (A) Data splitting. We split the dataset of unpaired OCT and paired VF-OCT scans and allocated the resulting sets of clean and artifact-laden scans for artifact correction and VF prediction. (B) Model training and validation. Model training includes artifact correction model and VF prediction model. We trained VF prediction models using five architectures on the artifact-corrected dataset (i.e. unprocessed clean scans plus artifact-corrected artifact-laden scans). (C) Model evaluation. After inferencing, we assessed the VF prediction using RMSE, MAE with model interpretability (i.e. GradCAM, GradCAM++, and attention map). DINO, Distillation with No Labels; HVF, Humphrey Visual Field; MAE, mean absolute error; OCT, optical coherence tomography; RMSE, root mean square error; ViT, Vision Transformer; VF, visual field.
Experimental Setup
During training, the artifact correction model restores pseudo artifact-masked clean scans to their original artifact-free state (Fig. 1B, upper panel). To prepare input images, we extracted pseudo-artifact pixels from artifact-laden scans, randomly flipped them horizontally and vertically, and applied them to clean images. The dataset was divided in a 9:1 ratio for training and validation. However, we did not exclusively test and report the model's performance on any additional datasets beyond the validation sets. 
Model Setup
We used the UNet2D diffusion model, a state-of-the-art stable diffusion model25 with a UNet-like architecture.26 We developed it with the open-source MONAI library and used the Denoising Diffusion Probabilistic Model (DDPM) inferer for both training and inference, setting a maximum of 300 training epochs. We applied Adam optimization with an initial learning rate of 2.5 × 10⁻⁵, using the MONAI DDPM scheduler, and implemented early stopping after 25 epochs. The lowest validated mean squared error (MSE) loss was chosen for inference. All experiments ran on an RTX 3060 GPU using the Pytorch framework. 
Visual Field Prediction Model
Dataset
The VF prediction dataset consists of paired VF-OCT scans, including both clean and artifact-laden scans. The training set contains 1063 paired clean scans and 611 paired artifact-laden scans from 572 patients. The testing set contains 283 paired clean scans and 146 paired artifact-laden scans from 193 distinct patients. To ensure independence, we sourced all clean and artifact-laden paired VF-OCT scans used in the training (1063 + 611 = 1674) and testing (283 + 146 = 429) sets from different subjects. We noted that 689 clean scans are used in the training of artifact correction model only. We obtained the artifact-corrected training set from 1063 unprocessed clean scans and 611 artifact-corrected/artifact-laden scans for the artifact correction process. The artifact-corrected testing set contains 283 unprocessed clean scans and 146 artifact-corrected/artifact-laden scans. 
Experimental Setup
We included both CNN and transformer models for VF prediction. We trained and tested each model architecture under two conditions. The control condition utilized the original dataset (not shown in Fig. 1). In contrast, the experimental condition, which was partially interconnected with the artifact-correction model, utilized the artifact-corrected dataset (see Fig. 1B, lower panel; Fig. 1C). During training, we randomly divided the training set for training and validation at a ratio of 9:1. During testing, we evaluated VF prediction in terms of accuracy (i.e. root mean square error [RMSE] and mean absolute error [MAE]) and model interpretability (i.e. GradCAM, GradCAM++, and attention maps). 
Model Setup
For the architecture selection, previous studies have effectively utilized ocular imaging as inputs for VGG,2729 DenseNet,27,2931 and ResNet16,18,27,29,32 to extract features for tasks, such as glaucoma detection, progression prediction, and VF loss estimation. More recently, transformer-based architectures have been introduced to the medical field, with studies indicating that these architectures may offer superior or comparable performance in ophthalmic imaging, particularly in terms of generalizability and interpretability.3336 Here, we include three CNNs (DenseNet-121, ResNet-34, and VGG-16) and two transformers (Vision Transformer [ViT]-Base and DINO-ViT Base), all pre-trained on the ImageNet-1K37 dataset. Each model was customized for regression tasks. These adaptations involved replacing the original classification heads with a ReLU activation function and a linear regression head. The sources for each model architecture used in this study are summarized as follows: 
  • 1. DenseNet-12138: Adapted from the MONAI framework’s DenseNet-121.39
  • 2. ResNet-3440: Built on the ResNet-34 classification model from the Timm library.41
  • 3. VGG-1642: Primarily used for feature extraction, this model is sourced from the Timm library’s VGG-16 classification model41 and includes a pooling and a flattening layer for output shape conversion.
  • 4. ViT-Base, Patch Size 89: Adapted from the ViT classification model from the Timm library.41
  • 5. DINO-ViT Base, Patch Size 810: Originally sourced from the official DINO GitHub repository.
During training, we fine-tuned all model layers. Hyperparameters remained consistent across setups, except for the initial learning rate, which we adjusted to optimize validation performance for each model. We used the Adam optimizer with a weight decay of 0.0001, a batch size of 32, early stopping after 50 consecutive epochs, and a maximum of 500 epochs. RMSE was used as the loss function, and we selected the best-performing model during validation for final inference. 
Evaluation
We calculated the VF prediction error per scan using global RMSE and MAE across all 54 test points. The formulas applied were as follows:  
\begin{eqnarray*} && RMSE\ \left( {dB} \right)\\ && \quad = \sqrt {\mathop \sum \limits_{n = 1}^{54} {{{\frac{{\left( {True\ TH{{V}_n} - Predicted\ TH{{V}_n}} \right)}}{{54}}}}^2}}\\ && MAE\ \left( {dB} \right)\\ && \quad = \frac{1}{{54}}\mathop \sum \limits_{n = 1}^{54} \left| {True\ TH{{V}_n} - Predicted\ TH{{V}_n}} \right|, \end{eqnarray*}
where n = nth test point of the 24-2 HVF, THV = visual field threshold value. 
We quantified the impact of OCT artifact correction by measuring the reduction in RMSE and MAE. We compared the control model, trained on the original unprocessed dataset, with the experimental model, trained on artifact-corrected dataset. We denoted the difference, RMSEoriginal – RMSEartifact-corrected and MAEoriginal – MAEartifact-corrected, as RMSEdecr (dB) and MAEdecr (dB). We performed paired two-tailed t-tests and calculated 95% CIs using the critical t-value from the t-distribution with a significance level of 0.05 (Supplementary Table S1, Supplementary Table S2). The t-test is done using ttest_rel from scipy.stats. 
Model Interpretability
We utilized feature maps to identify specific areas in the scans that provide important features for VF prediction. To achieve this, we generated GradCAM43 and GradCAM++44 visualizations from the last normalization layer for all models. Additionally, we produced attention maps from the last attention layer for the two transformers. We compared these visualizations between models trained on the original dataset and those trained on the artifact-corrected dataset. All visualizations were adapted from the source code available in the ViT-PyTorch and PyTorch Grad-CAM GitHub repositories. 
Results
Dataset Characteristics
Artifact Correction Dataset
We trained the artifact correction model on 1511 clean OCT scans from 840 patients (Table 1). This dataset comprises 689 paired OCT-VF scans and 822 unpaired clean OCT scans. Of the total 768 scans, 70.8% were glaucomatous eyes, 22.3% were glaucoma suspects, and 6.9% were healthy eyes. Demographic details are provided in Table 1
Table 1.
 
Demographics and Ophthalmic Characteristics of the Artifact Correction Dataset
Table 1.
 
Demographics and Ophthalmic Characteristics of the Artifact Correction Dataset
Visual Field Prediction Dataset
For VF prediction, the training set includes 1674 VF-OCT pairs from 951 eyes of 572 subjects (Table 2). This consists of 1063 clean OCT scans and 611 artifact-laden OCT scans. The test set comprises 429 VF-OCT pairs from 345 eyes of 193 subjects, including 283 clean OCT scans and 146 artifact-laden OCT scans. The mean age of participants was 69.05 ± 10.33 years for the training set and 70.22 ± 9.94 years for the testing set. The VF MD was –8.17 ± 6.60 dB in the training set and –7.48 ± 6.30 dB in the test set. The training set includes 62 (8.0%) healthy eyes, 203 (21.2%) glaucoma suspect eyes, and 686 (70.8%) eyes with various subtypes of glaucoma (Table 3). The test set includes 20 (5.6%) healthy eyes, 72 (22.4%) glaucoma suspect eyes, and 253 (72.0%) glaucomatous eyes. Additional demographic characteristics are reported in Tables 2, 3
Table 2.
 
Demographics and Ophthalmic Characteristics of the VF Prediction Dataset
Table 2.
 
Demographics and Ophthalmic Characteristics of the VF Prediction Dataset
Table 3.
 
Demographic Distribution of the VF Prediction Dataset, Categorized by Scan Quality and Glaucoma Diagnosis
Table 3.
 
Demographic Distribution of the VF Prediction Dataset, Categorized by Scan Quality and Glaucoma Diagnosis
Prediction Accuracy
The artifact correction model achieved the best validation MSE of 2 × 10−5 at epoch 134, effectively restoring affected areas while preserving artifact-free regions in most scans. However, it was less effective in scans with extensive artifact areas (see Supplementary Fig. S1). 
Transformers consistently outperformed CNNs in VF prediction across both original and artifact-corrected datasets (Table 4). DINO-ViT trained on artifact-corrected data achieved the highest accuracy with RMSE = 4.44 dB and MAE = 3.46 dB (Fig. 2A), whereas VGG, a CNN model trained on artifact-corrected data, performed the worst with RMSE = 5.03 dB. The second-best model was DINO-ViT on the original dataset (RMSE = 4.59 dB). Among CNNs, ResNet trained on the original dataset had the best performance but yielded lower accuracy than DINO-ViT, with no significant difference (t = –0.75, P = 0.45; see Supplementary Table S1). However, on the artifact-corrected dataset, DINO-ViT significantly outperformed ResNet (t = –3.42, P = 0). 
Table 4.
 
VF Prediction Accuracy of Five Models Using Original and Artifact-Corrected Datasets as Input
Table 4.
 
VF Prediction Accuracy of Five Models Using Original and Artifact-Corrected Datasets as Input
Figure 2.
 
DINO-ViT results using the artifact-corrected dataset. (A) Sample predictions. (B) Pointwise MAE (dB) and improvement compared to training without artifact correction. Pointwise MAE improvement (MAEdecr) is positive for improvements and negative for worsened MAE. dB, decibels; DINO-ViT, Distillation with No Labels Vision Transformer; MAE, mean absolute error.
Figure 2.
 
DINO-ViT results using the artifact-corrected dataset. (A) Sample predictions. (B) Pointwise MAE (dB) and improvement compared to training without artifact correction. Pointwise MAE improvement (MAEdecr) is positive for improvements and negative for worsened MAE. dB, decibels; DINO-ViT, Distillation with No Labels Vision Transformer; MAE, mean absolute error.
We observed predictive improvements when the model utilized artifact correction preprocessing (see Supplementary Tables S1, S2). Both transformers demonstrated enhanced accuracy with artifact correction. Notably, DINO-ViT achieved significant reductions in error, with RMSEdecr = 0.15 dB (P = 0.04) and MAEdecr = 0.15 dB (P = 0.01). The most improvement was observed in artifact-laden scans, with MAEdecr = 0.17 dB (P = 0.045). In contrast, none of the CNN models showed significant changes in performance after artifact correction. 
Model Interpretability
Saliency Map Interpretation
Artifact-Laden Input
 Figure 3A compares Grad-CAM and Grad-CAM++ outputs from transformers and CNNs before and after artifact correction. Initially, both transformers highlight the artifact region while focusing on the optic cup and disc. After correction, DINO-ViT emphasizes the enlarged optic cup characteristic of glaucoma, ignoring the previously affected area. Grad-CAM++ further highlights the thinning of the neuroretinal rim. This finding aligns with patterns of glaucomatous RNFL thickness changes, which is critical for clinicians in diagnosing and monitoring disease progression.45 The ViT model shows scattered attention in the artifact-corrected image, shifting focus to areas outside the optic disc where significant RNFL damage indicates extensive visual impairment. 
Figure 3.
 
(A–C) Grad-CAM and Grad-CAM++ visualizations comparing five models in (A, B) and specifically from DINO-ViT in (C). (D) Attention maps from ViT and DINO-ViT. (A) An artifact-laden scan was input into the original dataset, while its artifact-corrected version was used in the artifact-corrected dataset. (B) A clean, unmodified scan served as input for both datasets. (C) DINO-ViT visualizations for five scans (a–e). (D) Attention maps of two artifact-laden scans (a, b), with corrected versions as inputs in the original and artifact-corrected datasets, respectively. ViT, Vision Transformer; DINO, Distillation with No Labels.
Figure 3.
 
(A–C) Grad-CAM and Grad-CAM++ visualizations comparing five models in (A, B) and specifically from DINO-ViT in (C). (D) Attention maps from ViT and DINO-ViT. (A) An artifact-laden scan was input into the original dataset, while its artifact-corrected version was used in the artifact-corrected dataset. (B) A clean, unmodified scan served as input for both datasets. (C) DINO-ViT visualizations for five scans (a–e). (D) Attention maps of two artifact-laden scans (a, b), with corrected versions as inputs in the original and artifact-corrected datasets, respectively. ViT, Vision Transformer; DINO, Distillation with No Labels.
In contrast, CNNs generate similar saliency maps whether using inputs with or without artifact correction, each highlighting slightly different areas around the optic disc. They consistently exhibit minimal or no attention to the artifact region, indicating that CNNs do not depend on these areas for predictions. 
Clean Input
 Figure 3B showcases the analysis of a clean scan, utilizing the same input for both artifact-corrected and original datasets. Whereas CNNs remained focused on the optic disc, the transformers captured distinct features. In the artifact-corrected dataset, DINO-ViT Grad-CAM highlighted the intact superior nerve fiber bundle and optic disc, whereas Grad-CAM++ emphasized the thin neuroretinal rim more clearly. The ViT Grad-CAM displayed both intact superior fibers and the optic cup, whereas Grad-CAM++ scattered attention across the remaining areas. 
DINO-ViT
We observed that the Grad-CAM and Grad-CAM++ results from DINO-ViT closely align with critical areas for identifying abnormalities in optic disc features or RNFL changes (Fig. 3C). The model highlights intact nerve fiber bundles and the preserved neuroretinal rim in scans from the artifact-corrected dataset (see Figs. 3Ca–Cc). Notably, Figure 3Cd shows Grad-CAM identifying wedge-shaped defects of the inferior nerve fiber bundle, correlating with localized scotomas in the upper hemifield. Conversely, in cases of diffuse light sensitivity depression without localized neuroretinal rim notching, the model focuses on regions outside the optic disc and nerve fibers (see Fig. 3Ce). 
Attention Map Interpretation
We observed slight differences in the attention maps between the two transformers (Fig. 3D). The ViT trained on artifact-laden scans exhibits scattered attention, predominantly on the neuroretinal rim. After correction, attention becomes more localized at the superior or inferior poles. DINO-ViT shows more pronounced results. When trained on the original scans, it highlights less relevant areas, like artifact regions (see Fig. 3Db). With artifact correction, DINO-ViT focuses on glaucomatous patterns, such as wedge-shaped RNFL thinning (see Fig. 3Da) and the intact superior nerve fiber bundles along with the remaining neuroretinal rim in Figure 3Db, whereas still attending to residual artifacts beneath the optic disc, as the artifact correction process does not completely eliminate all artifact pixels. 
Uniform Manifold Approximation and Projection Interpretation
We used Uniform Manifold Approximation and Projection (UMAP) to visualize feature representations extracted by DINO-ViT (Fig. 4). Each point corresponds to features from an individual OCT scan. Both models, trained with and without artifact correction, showed clustering patterns related to degree of visual impairment. However, the model trained on the original dataset displayed a clear separation between clean and artifact-laden scans, whereas the artifact-corrected dataset exhibited a more homogeneous distribution (see Fig. 4B). 
Figure 4.
 
DINO-ViT UMAP. (A) Original versus artifact-corrected dataset. (B) Subplots for mild, moderate, and severe VF loss. Points are color-coded based on scan quality (clean or artifact-laden) and disease severity as measured by VF mean deviation (MD): mild (MD > −6 dB), moderate (−12 ≤ MD ≤ −6 dB), and severe (MD < −12 dB). dB, decibels; DINO-ViT, Distillation with NO Labels Vision Transformer; OCT, optical coherence tomography; UMAP, Uniform Manifold Approximation and Projection; VF, visual field.
Figure 4.
 
DINO-ViT UMAP. (A) Original versus artifact-corrected dataset. (B) Subplots for mild, moderate, and severe VF loss. Points are color-coded based on scan quality (clean or artifact-laden) and disease severity as measured by VF mean deviation (MD): mild (MD > −6 dB), moderate (−12 ≤ MD ≤ −6 dB), and severe (MD < −12 dB). dB, decibels; DINO-ViT, Distillation with NO Labels Vision Transformer; OCT, optical coherence tomography; UMAP, Uniform Manifold Approximation and Projection; VF, visual field.
Discussion
We developed a DL approach to predict pointwise threshold sensitivities in the 24-2 HVF test from peripapillary RNFL thickness maps. Our results indicated that transformers outperformed CNNs. However, visualization tools showed that transformer models often attended to artifact pixels. To mitigate this, we applied a generative technique to restore artifact-affected areas. We then evaluated VF prediction models with both original and artifact-corrected datasets, comparing three CNNs (DenseNet-121, ResNet-34, and VGG-16) alongside two transformers (ViT-Base and DINO-ViT Base). 
Artifact Correction
Our dataset, along with findings from other studies,46,47 indicates that over 40% of RNFL thickness maps contain artifacts, with higher prevalence in severe glaucoma cases,4648 eyes with extensive peripapillary atrophy, myopic eyes48,49(p77), and inaccurate device measurements.50,51 These factors can lead to unreliable OCT results. However, in our dataset, artifact-laden scans mainly consist of blink artifacts and small floaters, which arise from localized signal interruptions leading to missing data areas.52 We hypothesize that restoring these areas could enhance the informativeness of the scans, thereby improving the structure-function relationship of VFs. Our findings suggest that this method is particularly effective for transformers, benefiting both accuracy and interpretability. 
In a related study, Shi et al. (2024)51 used unsupervised DL to correct RNFL thickness artifacts by applying vectorized feature representations to retain essential thickness patterns. Their method synthesizes plausible components to enhance RNFL maps. In contrast, we used a denoising diffusion model that preserves original scan details. The model is effective for medical image augmentation tasks5357 and helps maintain OCT B-scans while enhancing anatomic structures.58 In terms of performance, Shi et al. (2023)28 reported pointwise MAE values of 2.56 to 4.85 dB for predicting total deviations across 52 test points. In comparison, our DINO-ViT, trained on artifact-corrected maps, achieved pointwise MAE ranges of 1.61 to 5.39 dB for the right eyes and 1.87 to 4.98 dB for the left eyes across 54 THVs (Fig. 2B, upper panel). Their model improved MAE by 0 to 0.07 dB post-correction, whereas our model showed a wider MAE improvement range, from a reduction of 0.22 dB to an increase of 0.54 dB (see Fig. 2B, lower panel). 
Visual Field Prediction Accuracy
In our study, DINO-ViT trained on the artifact-corrected dataset performed best (RMSE = 4.44 dB and MAE = 3.46 dB; see Table 4), showing significant improvement with artifact correction (RMSEdecr = 0.15 dB, P = 0.04 and MAEdecr = 0.15 dB, P = 0.01). The model particularly benefited artifact-laden groups, achieving an MAEdecr of 0.17 dB (P = 0.045). 
Inherent variability in VF testing can affect results’ reliability, underscoring the importance of test-retest outcomes to confirm and improve reproducibility in our findings. Zulauf et al., cited by Boeglin et al. (1992),59 reported a mean fluctuation of 4.25 dB in VF sensitivity for stable glaucoma patients. Similarly, Iester et al. (2020)60 found fluctuations of 2.16 ± 0.5 dB in the short term and 3.23 ± 0.5 dB in the long term among healthy subjects. Given these figures, our model’s predictions, with an MAE of 3.46 dB, are close to this natural variability. While this proximity is promising, further validation is needed to confirm clinical utility. 
Previous studies have used DL to predict visual fields from OCT imaging, but our approach differs in two key ways. First, we include all 54 SAP test points, whereas prior studies often exclude the two points associated with the physiological blind spot. Including these points is crucial, as the blind spot's position varies among populations and shows a bimodal distribution on the vertical axis in glaucomatous eyes.61 Our ground truth VF shows a bimodal distribution for the physiological blind spot at points 20 (left eye) and 26 (right eye), clustering near 0 dB and around 20 dB (see Supplementary Fig. S2). We concluded that physiological blind spot predictions are beneficial for patients with light responsiveness at this point and for those with altitudinal defects in moderate to advanced glaucomatous optic neuropathy. Including the natural blind spot location improves clinical interpretation, allowing for more personalized predictions, as it spatially influences visual field sensitivity.61 Second, we maintain the original laterality of OCT scans, unlike other studies that flip both OCT and VF data to a uniform left or right eye format.18,19,6266 Due to these differences, the predictive accuracy of our study cannot be directly compared with previous research. 
In a similar study, Park, Kim, and Lee (2020)18 utilized a combination of macular GCIPL (mGCIPL) and RNFL thickness with the Inception V3 model to predict global HFA 24-2, reporting an RMSE of 4.79 ± 2.56 dB. Similarly, Shin et al. (2021)19 used Inception-ResNet-v2 with inputs from both spectral-domain and swept-source OCT, achieving RMSEs of 5.29 ± 2.68 dB and 4.51 ± 2.54 dB, respectively. Other studies, like Guo et al. (2017)62 and Mariottoni et al. (2020),64 reported an RMSE of 5.42 dB and an MAE of 4.25 dB, respectively. Relevant studies aim to predict central visual field sensitivity, thus targeting the 10-2 HFA test.67,68 Some studies indicate that OCT volumes and B-scans predict visual field outcomes more accurately than RNFL thickness maps.6971 These approaches benefit from providing additional depth information and detailed blood vessel profiles, offering a more comprehensive understanding of structure-function relationships. 
Model Interpretability
We used explainable tools to assess the relationship between OCT and VF predictions, linking extracted features with model performance on evaluation metrics. This analysis addresses two key questions: first, we explored the unique features prioritized by transformers that contribute to their superior performance over CNNs. Second, we examined how model behavior changes with artifact-corrected data, aiming to understand the impact of artifact correction on prediction accuracy and interpretability. 
In predicting VF outcomes, Grad-CAM and Grad-CAM++ visualizations reveal that DINO-ViT and ViT focus on critical glaucomatous regions within OCT scans, including the optic cup, neuroretinal rim, and both intact and damaged nerve fiber bundles. Other studies on transformers, like DeiT, also emphasize the neuroretinal rim in glaucoma classification.36 Meanwhile, CNNs primarily rely on the optic disc for predictions. Using principal component analysis (PCA), we extracted three principal components corresponding to the optic cup and disc, intact retinal nerve fiber bundles, and remaining RNFL areas (Supplementary Fig. S3). Combined with saliency maps, the result suggests that DINO-ViT integrates these anatomic details independently yet cohesively in detecting glaucomatous changes. 
With artifact correction, DINO-ViT shows the best RMSEdecr and MAEdecr, achieving significant improvement (t = –2.09, P = 0.04; see Supplementary Table S1). We hypothesize that this enhancement arises from restoring missing parts with plausible pixels, making the scans more informative. However, the improvement might also be due to the reduction of bias introduced by artifacts, as observed in UMAP results (see Fig. 4). In the original dataset, artifact-laden and clean scans cluster distinctly, whereas the artifact-corrected dataset shows dispersed artifact-corrected scans among the clean ones, suggesting that the corrections help mitigate confounding biases. Other visualizations, such as Grad-CAM and attention maps, further support these findings. 
Here, we suggest that reliable AI applications in medical imaging should incorporate various visualization techniques to mitigate model pitfalls and enhance interpretability. Furthermore, preprocessing steps such as denoising and repairing dead pixels potentially improve medical image analysis outcomes. 
Limitations and Future Work
Our study has several potential limitations. First, we note that transformers typically require large datasets for training. However, we achieved favorable results with a relatively modest dataset comprising 572 patients and 1674 OCT scans. Second, there are pitfalls in the artifact correction model. Although the diffusion model can restore areas with similar colors and shapes, it is less effective for heavily artifacted, low-quality scans (see Supplementary Fig. S1). We identified potential factors that may impair the model’s performance. Our artifact removal pipeline using DDPM was trained without transfer learning, as the available pretrained weights were for grayscale images, whereas our data are in red, green, and blue (RGB). Additionally, the training set was limited and included minor scans with small artifacts labeled as clean. We initially set a slightly low black pixel threshold for labeling due to a lack of sufficient clean scans for training, which could have affected the model’s learning process. Due to dataset limitations, we did not allocate clean scans specifically for model evaluation. An analysis of the artifact correction model can be further explored here. 
Last, whereas training and evaluating the model on an external dataset could assess its generalizability, the current unavailability of public external datasets limits this assessment. Future studies could broaden the scope by including diverse demographics, such as various ethnicities or nationalities, and incorporating OCT scans from different manufacturers, potentially improving the model's applicability across different populations and devices. 
Conclusions
In conclusion, we developed a DL approach to predict 24-2 VFs from RNFL thickness maps. Our results show that transformers, particularly DINO-ViT, outperform CNNs in predictive accuracy. Visualization techniques indicate that the model learned clinically relevant features, enhancing its reliability. We recognized that artifacts in RNFL thickness maps complicate the usability of OCT imaging in glaucoma management, as DL models may unintentionally learn correlations with these artifacts, leading to confounded VF predictions. To address this, we created a generative AI framework to restore artifact-laden pixels, improving predictive accuracy and interpretability of DINO-ViT. 
Acknowledgments
The authors thank Pawin Numthavaj, MD, PhD, from the Department of Clinical Epidemiology and Biostatistics at the Faculty of Medicine Ramathibodi Hospital, Mahidol University, Bangkok, Thailand, for providing advice on data analysis. We also thank Praewpailin Kaimuk, MD, from the Department of Ophthalmology at the Faculty of Medicine Ramathibodi Hospital, Mahidol University, Bangkok, Thailand, for assistance in reviewing electronic medical records and validating the glaucoma diagnosis for the study population. 
Funding for this research was provided by Tonkla Ramathibodi, Faculty of Medicine Ramathibodi Hospital, Mahidol University, Bangkok, Thailand (Dean’s Research Novice Award, grant no. 406/2566). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. 
Human Subjects: This study was approved by the Institutional Review Board (IRB) of the Faculty of Medicine Ramathibodi Hospital, Mahidol University, Bangkok, Thailand (IRB number: MURA2023/751) and was performed in accordance with all tenets of the Declaration of Helsinki. The Institutional Review Board waived the need for informed consent because of the retrospective nature of the study. 
We have published our code on GitHub at https://github.com/gaew25/OCT-VF-artifact-removal
Disclosure: K. Sriwatana, None; C. Puttanawarut, None; Y. Suwan, None; T. Achakulvisut, None 
References
Quigley HA, Broman AT. The number of people with glaucoma worldwide in 2010 and 2020. Br J Ophthalmol. 2006; 90(3): 262–267. [CrossRef] [PubMed]
Schroeder R, Lind JT, Budenz DL. 10.5 - Visual Fields. In: Ophthalmology. Sixth Edition. New York, NY: Elsevier Inc.; 2023: 973–981.
Bengtsson B, Heijl A, Olsson J. Evaluation of a new threshold visual field strategy, SITA, in normal subjects. Acta Ophthalmol Scand. 1998; 76(2): 165–169. [CrossRef] [PubMed]
Hu R, Racette L, Chen KS, Johnson CA. Functional assessment of glaucoma: uncovering progression. Surv Ophthalmol. 2020; 65(6): 639–661. [CrossRef] [PubMed]
Johnson CA . Visual fields: visual field test strategies. In: Giaconi JA, Law SK, Nouri-Mahdavi K, Coleman AL, Caprioli J, eds. Pearls of Glaucoma Management. New York, NY: Springer; 2016: 145–151.
Sathyan P, Anitha S. Optical coherence tomography in glaucoma. J Curr Glaucoma Pract. 2012; 6: 1–5. [CrossRef] [PubMed]
Ferro Desideri L, Rutigliani C, Corazza P, et al. The upcoming role of artificial intelligence (AI) for retinal and glaucomatous diseases. J Optom. 2022; 15: S50–S57. [CrossRef] [PubMed]
Li M, Jiang Y, Zhang Y, Zhu H. Medical image analysis using deep learning algorithms. Front Public Health. 2023; 11: 1273253. [CrossRef] [PubMed]
Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: transformers for image recognition at scale. arXiv Preprint. Published online June 3, 2021, doi:10.48550/arXiv.2010.11929.
Caron M, Touvron H, Misra I, et al. Emerging properties in self-supervised vision transformers. arXiv Preprint. Published online May 24, 2021, doi:10.48550/arXiv.2104.14294.
Singh P, Sizikova E, Cirrone J. CASS: cross architectural self-supervision for medical image analysis. arXiv Preprint. Published online November 19, 2022, doi:10.48550/arXiv.2206.04170.
Truong T, Mohammadi S, Lenga M. How transferable are self-supervised features in medical image classification tasks? arXiv Preprint. Published online November 30, 2021, doi:10.48550/arXiv.2108.10048.
Wessels F, Schmitt M, Krieghoff-Henning E, et al. A self-supervised vision transformer to predict survival from histopathology in renal cell carcinoma. World J Urol. 2023; 41(8): 2233–2241. [CrossRef] [PubMed]
Park S, Kim G, Oh Y, et al. Self-evolving vision transformer for chest X-ray diagnosis through knowledge distillation. Nat Commun. 2022; 13(1): 3848. [CrossRef] [PubMed]
Prokop J, Tordera JM, Jaworek-Korjakowska J, Mohammadi S. Deep metric learning for few-shot X-ray image classification. arXiv Preprint. Published online August 29, 2023:2023.08.27.23294690, doi:10.1101/2023.08.27.23294690.
Christopher M, Bowd C, Belghith A, et al. Deep learning approaches predict glaucomatous visual field damage from OCT optic nerve head en face images and retinal nerve fiber layer thickness maps. Ophthalmology. 2020; 127(3): 346–356. [CrossRef] [PubMed]
George Y, Antony BJ, Ishikawa H, Wollstein G, Schuman JS, Garnavi R. Attention-guided 3D-CNN framework for glaucoma detection and structural-functional association using volumetric images. IEEE J Biomed Health Inform. 2020; 24(12): 3421–3430. [CrossRef] [PubMed]
Park K, Kim J, Lee J. A deep learning approach to predict visual field using optical coherence tomography. PLoS One. 2020; 15(7): e0234902. [CrossRef] [PubMed]
Shin J, Kim S, Kim J, Park K. Visual field inference from optical coherence tomography using deep learning algorithms: a comparison between devices. Transl Vis Sci Technol. 2021; 10(7): 4. [CrossRef] [PubMed]
Yu HH, Maetschke SR, Antony BJ, et al. Estimating global visual field indices in glaucoma by combining macula and optic disc OCT scans using 3-dimensional convolutional neural networks. Ophthalmol Glaucoma. 2021; 4(1): 102–112. [CrossRef] [PubMed]
Kamalipour A, Moghimi S, Khosravi P, et al. Combining optical coherence tomography and optical coherence tomography angiography longitudinal data for the detection of visual field progression in glaucoma. Am J Ophthalmol. 2023; 246: 141–154. [CrossRef] [PubMed]
Buslaev A, Iglovikov VI, Khvedchenya E, Parinov A, Druzhinin M, Kalinin AA. Albumentations: fast and flexible image augmentations. Information. 2020; 11(2): 125. [CrossRef]
Wang X, Xie L, Dong C, Shan Y. Real-ESRGAN: training real-world blind super-resolution with pure synthetic data. arXiv Preprint. Published online August 17, 2021. Accessed April 23, 2024, http://arxiv.org/abs/2107.10833.
Saifee M. hvf-extraction-script: Python extraction script for HVF report images. Accessed April 23, 2024, https://github.com/msaifee786/hvf_extraction_script.
Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. arXiv Preprint. Published online December 16, 2020, doi:10.48550/arXiv.2006.11239.
Pinaya WHL, Graham MS, Kerfoot E, et al. Generative AI for medical imaging: extending the MONAI framework. arXiv Preprint. Published online July 27, 2023, doi:10.48550/arXiv.2307.15208.
Yi S, Zhang G, Qian C, Lu Y, Zhong H, He J. A multimodal classification architecture for the severity diagnosis of glaucoma based on deep learning. Front Neurosci. 2022; 16: 939472. [CrossRef] [PubMed]
Shi M, Sun JA, Lokhande A, et al. Artifact correction in retinal nerve fiber layer thickness maps using deep learning and its clinical utility in glaucoma. Transl Vis Sci Technol. 2023; 12(11): 12. [CrossRef] [PubMed]
Kim JA, Yoon H, Lee D, et al. Development of a deep learning system to detect glaucoma using macular vertical optical coherence tomography scans of myopic eyes. Sci Rep. 2023; 13(1): 8040. [CrossRef] [PubMed]
Fuentes-Hurtado F, Morales S, Mossi JM, et al. Deep-learning-based classification of rat OCT images after intravitreal injection of ET-1 for glaucoma understanding. In: Yin H, Camacho D, Novais P, Tallón-Ballesteros AJ, eds. Intelligent Data Engineering and Automated Learning – IDEAL 2018. New York, NY: Springer International Publishing; 2018: 27–34, doi:10.1007/978-3-030-03493-1_4.
Mohammadzadeh V, Vepa A, Li C, et al. Prediction of central visual field measures from macular OCT volume scans with deep learning. Transl Vis Sci Technol. 2023; 12(11): 5. [CrossRef] [PubMed]
Pham QTM , Han JC, Park DY, Shin J. Multimodal deep learning model of predicting future visual field for glaucoma patients. IEEE Access. 2023; 11: 19049–19058. [CrossRef]
Philippi D, Rothaus K, Castelli M. A vision transformer architecture for the automated segmentation of retinal lesions in spectral domain optical coherence tomography images. Sci Rep. 2023; 13(1): 517. [CrossRef] [PubMed]
Hwang EE, Chen D, Han Y, Jia L, Shan J. Multi-dataset comparison of vision transformers and convolutional neural networks for detecting glaucomatous optic neuropathy from fundus photographs. Bioengineering. 2023; 10(11): 1266. [CrossRef] [PubMed]
He J, Wang J, Han Z, Ma J, Wang C, Qi M. An interpretable transformer network for the retinal disease classification using optical coherence tomography. Sci Rep. 2023; 13(1): 3637. [CrossRef] [PubMed]
Fan R, Alipour K, Bowd C, et al. Detecting glaucoma from fundus photographs using deep learning without convolutions: transformer for improved generalization. Ophthalmol Sci. 2023; 3(1): 100233. [CrossRef] [PubMed]
Yalniz IZ, Jégou H, Chen K, Paluri M, Mahajan D. Billion-scale semi-supervised learning for image classification. arXiv Preprint. Published online May 1, 2019, doi:10.48550/arXiv.1905.00546.
Huang G, Liu Z, van der Maaten L, Weinberger KQ. Densely connected convolutional networks. arXiv Preprint. Published online January 28, 2018, doi:10.48550/arXiv.1608.06993.
Cardoso MJ, Li W, Brown R, et al. MONAI: an open-source framework for deep learning in healthcare. arXiv Preprint. Published online November 4, 2022, doi:10.48550/arXiv.2211.02701.
Wightman R, Touvron H, Jégou H. ResNet strikes back: an improved training procedure in Timm. arXiv Preprint. Published online October 1, 2021, doi:10.48550/arXiv.2110.00476.
Wightman R . PyTorch image models. GitHub. Published online 2019, doi:10.5281/zenodo.4414861.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv Preprint. Published online April 10, 2015, doi:10.48550/arXiv.1409.1556.
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: visual explanations from deep networks via gradient-based localization. Int J Comput Vis. 2020; 128(2): 336–359. [CrossRef]
Chattopadhyay A, Sarkar A, Howlader P, Balasubramanian VN. Grad-CAM++: improved visual explanations for deep convolutional networks. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). 2018: 839–847.
Asrani S, Essaid L, Alder BD, Santiago-Turla C. Artifacts in spectral-domain optical coherence tomography measurements in glaucoma. JAMA Ophthalmol. 2014; 132(4): 396–402. [CrossRef] [PubMed]
Liu Y, Simavli H, Que CJ, et al. Patient characteristics associated with artifacts in Spectralis optical coherence tomography imaging of the retinal nerve fiber layer in glaucoma. Am J Ophthalmol. 2015; 159(3): 565–576.e2. [CrossRef] [PubMed]
Li A, Thompson AC, Asrani S. Impact of artifacts from optical coherence tomography retinal nerve fiber layer and macula scans on detection of glaucoma progression. Am J Ophthalmol. 2021; 221: 235–245. [CrossRef] [PubMed]
Bayer A, Akman A. Artifacts and anatomic variations in optical coherence tomography. Turk J Ophthalmol. 2020; 50(2): 99–106. [CrossRef] [PubMed]
Tanna AP . 2022-2023 Basic and Clinical Science Course, Section 10: Glaucoma. San Francisco, CA: American Academy of Ophthalmology; 2022.
Chen JJ, Kardon RH. Avoiding clinical misinterpretation and artifacts of optical coherence tomography analysis of the optic nerve, retinal nerve fiber layer, and ganglion cell layer. J Neuroophthalmol. 2016; 36(4): 417. [CrossRef] [PubMed]
Shi M, Tian Y, Luo Y, Elze T, Wang M. RNFLT2Vec: artifact-corrected representation learning for retinal nerve fiber layer thickness maps. Med Image Anal. 2024; 94: 103110. [CrossRef] [PubMed]
Hardin JS, Taibbi G, Nelson SC, Chao D, Vizzeri G. Factors affecting Cirrus-HD OCT optic disc scan quality: a review with case examples. J Ophthalmol. 2015; 2015(1): 746150. [PubMed]
Kebaili A, Lapuyade-Lahorgue J, Ruan S. Deep learning approaches for data augmentation in medical imaging: a review. J Imaging. 2023; 9(4): 81. [CrossRef] [PubMed]
Lyu Q, Wang G. Conversion between CT and MRI images using diffusion and score-matching models. arXiv Preprint. Published online September 29, 2022, doi:10.48550/arXiv.2209.12104.
Liu Z, Ma C, She W, Xie M. Biomedical image segmentation using denoising diffusion probabilistic models: a comprehensive review and analysis. Appl Sci. 2024; 14(2): 632. [CrossRef]
Müller-Franzes G, Niehues JM, Khader F, et al. Diffusion probabilistic models beat GANs on medical images. Sci Rep. 2023; 13(1): 12098. [CrossRef] [PubMed]
Müller-Franzes G, Niehues JM, Khader F, et al. A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis. Sci Rep. 2023; 13(1): 12098. [CrossRef] [PubMed]
Hu D, Tao YK, Oguz I. Unsupervised denoising of retinal OCT with diffusion probabilistic model. arXiv Preprint. Published online January 27, 2022, doi:10.48550/arXiv.2201.11760.
Boeglin RJ, Caprioli J, Zulauf M. Long-term fluctuation of the visual field in glaucoma. Am J Ophthalmol. 1992; 113(4): 396–400. [CrossRef] [PubMed]
Iester M, Capris P, Pandolfo A, Zingirian M, Traverso CE. Learning effect, short-term fluctuation, and long-term fluctuation in frequency doubling technique. Am J Ophthalmol. 2000; 130(2): 160–164. [CrossRef] [PubMed]
Wang M, Shen LQ, Boland MV, et al. Impact of natural blind spot location on perimetry. Sci Rep. 2017; 7(1): 6143. [CrossRef] [PubMed]
Guo Z, Kwon YH, Lee K, et al. Optical coherence tomography analysis based prediction of Humphrey 24-2 visual field thresholds in patients with glaucoma. Invest Ophthalmol Vis Sci. 2017; 58(10): 3975–3985. [CrossRef] [PubMed]
Hemelings R, Elen B, Barbosa-Breda J, et al. Pointwise visual field estimation from optical coherence tomography in glaucoma using deep learning. Transl Vis Sci Technol. 2022; 11(8): 22. [CrossRef] [PubMed]
Mariottoni EB, Datta S, Dov D, et al. Artificial intelligence mapping of structure to function in glaucoma. Transl Vis Sci Technol. 2020; 9(2): 19. [CrossRef] [PubMed]
Park K, Kim J, Kim S, Shin J. Prediction of visual field from swept-source optical coherence tomography using deep learning algorithms. Graefes Arch Clin Exp Ophthalmol. 2020; 258(11): 2489–2499. [CrossRef] [PubMed]
Kihara Y, Montesano G, Chen A, et al. Policy-driven, multimodal deep learning for predicting visual fields from the optic disc and optical coherence tomography imaging. Ophthalmology. 2022; 129(7): 781. [CrossRef] [PubMed]
Hashimoto Y, Asaoka R, Kiwaki T, et al. Deep learning model to predict visual field in central 10° from optical coherence tomography measurement in glaucoma. Br J Ophthalmol. 2021; 105(4): 507–513. [CrossRef] [PubMed]
Kamalipour A, Moghimi S, Khosravi P, et al. Deep learning estimation of 10-2 visual field map based on circumpapillary retinal nerve fiber layer thickness measurements. Am J Ophthalmol. 2023; 246: 163–173. [CrossRef] [PubMed]
Chen Z, Shemuelian E, Wollstein G, Wang Y, Ishikawa H, Schuman JS. Segmentation-free OCT-volume-based deep learning model improves pointwise visual field sensitivity estimation. Transl Vis Sci Technol. 2023; 12(6): 28. [CrossRef]
Lazaridis G, Montesano G, Afgeh SS, et al. Predicting visual fields from optical coherence tomography via an ensemble of deep representation learners. Am J Ophthalmol. 2022; 238: 52–65. [CrossRef] [PubMed]
Chen Z, Ishikawa H, Wang Y, Wollstein G, Schuman JS. Deep-learning-based group pointwise spatial mapping of structure to function in glaucoma. Ophthalmol Sci. 2024; 4(5): 100523. [CrossRef] [PubMed]
Figure 1.
 
Schematic representation of artifact correction and VF prediction model development. (A) Data splitting. We split the dataset of unpaired OCT and paired VF-OCT scans and allocated the resulting sets of clean and artifact-laden scans for artifact correction and VF prediction. (B) Model training and validation. Model training includes artifact correction model and VF prediction model. We trained VF prediction models using five architectures on the artifact-corrected dataset (i.e. unprocessed clean scans plus artifact-corrected artifact-laden scans). (C) Model evaluation. After inferencing, we assessed the VF prediction using RMSE, MAE with model interpretability (i.e. GradCAM, GradCAM++, and attention map). DINO, Distillation with No Labels; HVF, Humphrey Visual Field; MAE, mean absolute error; OCT, optical coherence tomography; RMSE, root mean square error; ViT, Vision Transformer; VF, visual field.
Figure 1.
 
Schematic representation of artifact correction and VF prediction model development. (A) Data splitting. We split the dataset of unpaired OCT and paired VF-OCT scans and allocated the resulting sets of clean and artifact-laden scans for artifact correction and VF prediction. (B) Model training and validation. Model training includes artifact correction model and VF prediction model. We trained VF prediction models using five architectures on the artifact-corrected dataset (i.e. unprocessed clean scans plus artifact-corrected artifact-laden scans). (C) Model evaluation. After inferencing, we assessed the VF prediction using RMSE, MAE with model interpretability (i.e. GradCAM, GradCAM++, and attention map). DINO, Distillation with No Labels; HVF, Humphrey Visual Field; MAE, mean absolute error; OCT, optical coherence tomography; RMSE, root mean square error; ViT, Vision Transformer; VF, visual field.
Figure 2.
 
DINO-ViT results using the artifact-corrected dataset. (A) Sample predictions. (B) Pointwise MAE (dB) and improvement compared to training without artifact correction. Pointwise MAE improvement (MAEdecr) is positive for improvements and negative for worsened MAE. dB, decibels; DINO-ViT, Distillation with No Labels Vision Transformer; MAE, mean absolute error.
Figure 2.
 
DINO-ViT results using the artifact-corrected dataset. (A) Sample predictions. (B) Pointwise MAE (dB) and improvement compared to training without artifact correction. Pointwise MAE improvement (MAEdecr) is positive for improvements and negative for worsened MAE. dB, decibels; DINO-ViT, Distillation with No Labels Vision Transformer; MAE, mean absolute error.
Figure 3.
 
(A–C) Grad-CAM and Grad-CAM++ visualizations comparing five models in (A, B) and specifically from DINO-ViT in (C). (D) Attention maps from ViT and DINO-ViT. (A) An artifact-laden scan was input into the original dataset, while its artifact-corrected version was used in the artifact-corrected dataset. (B) A clean, unmodified scan served as input for both datasets. (C) DINO-ViT visualizations for five scans (a–e). (D) Attention maps of two artifact-laden scans (a, b), with corrected versions as inputs in the original and artifact-corrected datasets, respectively. ViT, Vision Transformer; DINO, Distillation with No Labels.
Figure 3.
 
(A–C) Grad-CAM and Grad-CAM++ visualizations comparing five models in (A, B) and specifically from DINO-ViT in (C). (D) Attention maps from ViT and DINO-ViT. (A) An artifact-laden scan was input into the original dataset, while its artifact-corrected version was used in the artifact-corrected dataset. (B) A clean, unmodified scan served as input for both datasets. (C) DINO-ViT visualizations for five scans (a–e). (D) Attention maps of two artifact-laden scans (a, b), with corrected versions as inputs in the original and artifact-corrected datasets, respectively. ViT, Vision Transformer; DINO, Distillation with No Labels.
Figure 4.
 
DINO-ViT UMAP. (A) Original versus artifact-corrected dataset. (B) Subplots for mild, moderate, and severe VF loss. Points are color-coded based on scan quality (clean or artifact-laden) and disease severity as measured by VF mean deviation (MD): mild (MD > −6 dB), moderate (−12 ≤ MD ≤ −6 dB), and severe (MD < −12 dB). dB, decibels; DINO-ViT, Distillation with NO Labels Vision Transformer; OCT, optical coherence tomography; UMAP, Uniform Manifold Approximation and Projection; VF, visual field.
Figure 4.
 
DINO-ViT UMAP. (A) Original versus artifact-corrected dataset. (B) Subplots for mild, moderate, and severe VF loss. Points are color-coded based on scan quality (clean or artifact-laden) and disease severity as measured by VF mean deviation (MD): mild (MD > −6 dB), moderate (−12 ≤ MD ≤ −6 dB), and severe (MD < −12 dB). dB, decibels; DINO-ViT, Distillation with NO Labels Vision Transformer; OCT, optical coherence tomography; UMAP, Uniform Manifold Approximation and Projection; VF, visual field.
Table 1.
 
Demographics and Ophthalmic Characteristics of the Artifact Correction Dataset
Table 1.
 
Demographics and Ophthalmic Characteristics of the Artifact Correction Dataset
Table 2.
 
Demographics and Ophthalmic Characteristics of the VF Prediction Dataset
Table 2.
 
Demographics and Ophthalmic Characteristics of the VF Prediction Dataset
Table 3.
 
Demographic Distribution of the VF Prediction Dataset, Categorized by Scan Quality and Glaucoma Diagnosis
Table 3.
 
Demographic Distribution of the VF Prediction Dataset, Categorized by Scan Quality and Glaucoma Diagnosis
Table 4.
 
VF Prediction Accuracy of Five Models Using Original and Artifact-Corrected Datasets as Input
Table 4.
 
VF Prediction Accuracy of Five Models Using Original and Artifact-Corrected Datasets as Input
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×