This study evaluated a robust DL-based framework designed for automatic anomaly detection in retinal OCT volumes, demonstrating promising results at both the volume and individual B-scan levels. Notably, we conducted an extensive evaluation of the method using a large and clinically relevant OCT dataset, ensuring a thorough assessment of its performance and potential clinical applicability. The method uses an unsupervised knowledge distillation T – S approach to identify anomalous cases. As a reminder, in this study, we considered an OCT volume as anomalous if it showed signs of retinal disease, which in our test dataset appeared as: iAMD, nAMD, GA, DME, Stargardt disease, RVO, or CSC. The dataset contains cases with different severity stages from each of the diseases.
Automated retinal OCT anomaly detection is an active research topic. For instance, Seebock et al.
38 proposed to use an uncertainty-aware retinal layer segmentation model trained in normal data to identify potential anomalous OCT regions. In anomalous cases, it becomes difficult for the network to properly segment layers, increasing its uncertainty and thus allowing for anomaly detection. However, this method may miss small anomalies, such as drusen, if the network is still properly capable of segmenting the retinal layers. In addition, weakly supervised approaches are not as future-proof as desirable, because annotation efforts have to be repeated whenever shifts in acquisition settings occur, for example, due to a new acquisition equipment. Because of this, another well-established approach is to use normal data to train generative models.
39,40 In essence, these methods work by learning to reconstruct normal B-scans. At test time, when provided with anomalous cases, the networks fail the reconstruction process because they were never exposed to such type of cases. As a consequence, measuring the reconstruction error between the input and generated images allows to infer if a B-scan is anomalous. However, reconstruction-based approaches are notably difficult to train due to their computational demand and need for careful hyper-parameter tuning and potential training instabilities. In addition, although conceptually similar to the approach here presented, which only uses normal data for training, measuring differences in the pixel space, instead of in the feature space, make these approaches very susceptible to noise and reconstruction errors, ultimately increasing the risk of FP detections.
A notable example of an unsupervised retinal OCT detection method is presented by Tiosano et al.
41 In their work, the authors propose to use a CNN pre-trained in natural images to create a region-level feature set from normal OCT cases. These local region representations are then processed to eliminate redundant samples, creating a representative feature bank. At test time, anomalies are detected by measuring the K nearest neighbors (kNNs) distance between the extracted features and those from the feature bank. Similarly to our approach, having a region-based distance computation allows to produce anomaly explanation maps. However, their method works on a single feature scale, making their out-of-the-box maps coarser. In addition, the performance of these types of multi-step approaches may be highly reliant on hyperparameter selection. Particularly, changes in the number of samples of the feature bank and neighbors during distance calculation can greatly impact the behavior of the approach.
Instead, in this work, we contribute to the existing research on unsupervised retinal OCT anomaly detection by assessing the performance of an easy to set-up end-to-end system that requires no annotation effort (apart from the collection of normal cases) and minimal hyperparameter tuning. The system showed robust detection performance of different pathomorphological manifestations spanning AMD, RVO, and DME. Our results highlight the potential of such types of DL-approaches in enhancing the accuracy and efficiency of retinal OCT image analysis, potentially reducing the burden on clinicians and care providers by providing a reliable preliminary screening tool.
One of the significant advantages of this DL-based anomaly detection system is its unsupervised nature, requiring only data of normal scans for training. Indeed, this approach circumvents the need for a large, labeled dataset of pathological images, which can be challenging and time-consuming to compile. Instead, by training the model exclusively on normal retinal OCT images, it learns to recognize the standard anatomic structures and patterns of the retina. Consequently, any deviation from this learned normality is flagged as an anomaly. Besides the simplification of the model preparation process, especially in terms of data collection and curation, such approach also ensures that the system can generalize to a wide variety of anomalies, including previously unseen pathological conditions. The unsupervised nature of the studied approach allows it to be both adaptable and robust, making it a highly practical solution for real-world clinical settings where the diversity of abnormalities is vast and continually evolving.
By achieving a high detection performance at the volume level, as demonstrated by the 0.94 average volume-level AUC (see the Detection of Anomalous OCT Volumes section,
Fig. 2), models such as this one ensure that entire retinal scans can be rapidly and accurately assessed for anomalies, facilitating early detection of various retinal diseases. At the B-scan level (see the Detection of Anomalous B-Scans section), the system's ability to pinpoint specific abnormal slices within a volume allows for more detailed examination and focused clinical attention on the most relevant areas of the dense OCT scan, optimizing the use of clinician time and expertise. In the context of daily clinical practice, this level of automation is essential for managing the increasing volume of retinal OCT scans, driven by a growing and aging population. On the other hand, the B-scan level detection performance was not the same for all types of pathologies. Indeed, we verified that the detection performance is directly related with the amount of pathology present and its size (see the first lesion size in
Fig. 5 and iAMD performance in
Fig. 6). A possible explanation is that smaller lesions are more subtle and therefore may not be adequately captured by the intermediate features of the teacher model
T. Nevertheless, the B-scan level detection AUC values >0.8 suggest that this approach can reduce a significant amount of workload in terms of identifying slices of diagnostic relevance.
Following the good B-scan-level anomaly detection performance, one of the standout features of our system is its explainability, which is crucial for clinical adoption. The explanation maps generated by the model provide a visual representation of the detected anomalies, enabling clinicians to understand the reasoning behind the model's decisions (see the Detection of Anomalous OCT Volumes section, in particular, see
Fig. 4). This transparency is vital to build trust in the system and facilitates its integration into diagnostic workflows. Moreover, the ability to visualize anomalies directly on retinal OCT images aids in the quick localization of potential issues, streamlining the diagnostic process. We've additionally shown (see the Association of Anomaly Score With Disease Severity section) that the scores of the anomaly maps correlate (Spearman's correlation coefficient >0.7) properly with the amount of disease activity. Because of all of this, producing these explanation maps not only assists in verifying the model's findings but may also enhance the clinician's ability to make informed decisions regarding patient care. On the other hand, the produced anomaly maps still do not properly highlight all pathological manifestations. As shown in
Figure 4 nAMD and RVO, large fluid pockets are not fully highlighted. This behavior is most likely due to the similarity of the intensity and noise profiles between large fluid pockets and background. As the method is inherently trained patch-wise, the Student
S has learned to properly represent low-intensity noisy regions during training (i.e. background) and thus is also partially capable of representing fluid. Despite this, the anomaly maps can still guide the attention of the user to the relevant location within the B-scan, and thus serve as a valuable tool to interpret the algorithm's output.
Unsupervised anomaly detection is a particularly challenging problem, especially in comparison to supervised learning. Unlike supervised methods that rely on labeled samples, unsupervised anomaly detection requires distinguishing unseen pathological patterns from normal variations without any additional information. A key challenge in this task is encoding invariances, that is, determining what the model should be insensitive to. This includes, for example, accounting for variations in acquisition settings, anatomical diversity across individuals, or benign structural changes. This remains an open research problem in medical computer vision. In this work, we focus on a proof-of-concept, demonstrating the feasibility of unsupervised anomaly detection for retinal OCT screening.
The presented work has a few limitations. Being an anomaly detection approach, an obvious pitfall is that the system cannot produce a probability of a specific disease being present in a volume. Indeed, there is a trade-off between the studied T-S approach, which is better at performing non-specific screening tasks, and fully supervised approaches, which can be used for accurately diagnosing a known group of diseases while making obvious errors for cases outside their learning curriculum. In the same line of thought, due to its generic anomaly detection capability, the approach may not be capable of distinguishing between anomalies caused by pathologies from those that result from the acquisition pipeline (e.g. noise and artifacts). Albeit not problematic for the studied datasets, in real-world data this limitation may lead to incorrect volume flagging. In addition, the anomaly maps produced by the system are not yet sufficiently comprehensive to enable fully automated detection of pathological biomarkers. In particular, they are not precise enough to pinpoint spatial locations, neither do they allow to differentiate between various types of abnormalities. This limitation makes it difficult to reliably identify the underlying causes of anomalies, especially in cases where the maps highlight regions that are merely distorted or out of place rather than directly indicative of pathological biomarkers, such as fluid accumulation. For example, in large fluid pockets, the receptive field of the model, that is, the effective region of the image the model is assessing, is not large enough to cover the entirety of the pathological region, thus leading to subpar explanation maps. These limitations highlight an exciting opportunity for further work, which can focus on enhancing the system's explainability by incorporating methods that provide more interpretable insights into why certain regions are flagged as anomalous. For example, integrating domain-specific priors, such as the expected thickness of retinal layers, could allow to improve the anomaly maps without sacrificing the generalization capabilities of the system. This would not only improve the system's reliability but also further increase its practical utility in clinical settings, where transparent and interpretable results are essential for effective decision making. Finally, the assessed system considers each B-scan individually and thus does not take advantage of the inherent inter-slice information available in volumetric OCT scans, hindering its ability to produce coherent results for neighboring scans. Incorporating 3-dimensional (3D) information into the system poses several challenges. First, leveraging 3D data significantly increases the computational complexity of the model. In particular, both memory and processing requirements grow because volumetric data involves processing not just individual 2D slices but also their spatial relationships in a 3D context. Such demanding computational requirements are not always available, making the management of 3D medical data a current topic of research in the scientific community. Second, the thickness of slices and their spacing in volumetric scans depend on the acquisition protocol, introducing variability that complicates the extraction of meaningful 3D features. For instance, thicker slices or uneven spacing can result in a loss of fine details and inconsistencies in the volumetric representation, which could degrade model performance. Moreover, the alignment of slices to maintain spatial coherence during pre-processing and model inference is another technical challenge, particularly in cases where motion artifacts or structural distortions are present in the data. To address these issues, future work should focus not only on designing more efficient algorithms capable of processing 3D information but also on extending dataset size and diversity to better capture these complexities and ensure robust model performance across a range of scenarios. In addition, although we evaluate our system in seven different common pathologies, there is a risk that our findings do not generalize to other pathomorphological manifestations. Future research should thus also consider extending the validation of the system to a wider range of diseases. Likewise, the in-house dataset primarily consists of Caucasian individuals, which may limit the generalizability of our findings to populations with different demographic compositions.
Additionally, all data in the in-house dataset were acquired using a single OCT device type (Spectralis), which may introduce certain limitations in applicability when considering other acquisition systems. This is primarily due to potential differences in factors such as noise profiles, field of view, and resolution. However, it is important to note that OCT screening is commonly device-specific, as both the hardware and software configurations of different devices influence the imaging characteristics. Therefore, the use of data from a single device is not an unusually restrictive limitation. Indeed, it is reasonable to expect that the general principles underlying the assessed anomaly detection framework can generalize effectively to other OCT devices with proper fine-tuning or calibration. That is, whereas the in-house dataset reflects the characteristics of the Spectralis-acquired images, the broader approach remains adaptable and applicable to images obtained from other systems.
In conclusion, the presented DL-based anomaly detection system for retinal OCT images shows the potential of unsupervised anomaly detection approaches. The high accuracy at both the volume and the B-scan level ensures comprehensive and reliable anomaly detection, while the anomaly maps provide necessary transparency for clinical use. These features collectively position the system as a valuable tool in the early detection and diagnosis of retinal diseases, potentially improving patient outcomes through timely and accurate interventions. The system's unsupervised nature, requiring only normal data for training, simplifies the data collection process and enhances its ability to detect a wide range of anomalies, including previously unseen conditions. Furthermore, by automating the analysis of large volumes of retinal OCT data, this system has the potential to both alleviate the workload of clinicians as well as increase diagnosis success by providing an objective and explainable second-opinion that could mitigate oversight caused by common factors such as fatigue, time stress, or subjective qualitative assessment. Ultimately, generic approaches such as the one here discussed can constitute a relevant complement to task-specific automated AI-based systems, improving clinical workflow and patient care.