Abstract
Purpose:
To evaluate the performance of the Pegasus-OCT (Visulytix Ltd) multiclass automated fluid segmentation algorithms on independent spectral domain optical coherence tomography data sets.
Methods:
The Pegasus automated fluid segmentation algorithms were applied to three data sets with edematous pathology, comprising 750, 600, and 110 b-scans, respectively. Intraretinal fluid (IRF), sub-retinal fluid (SRF), and pigment epithelial detachment (PED) were automatically segmented by Pegasus-OCT for each b-scan where ground truth from data set owners was available. Detection performance was assessed by calculating sensitivities and specificities, while Dice coefficients were used to assess agreement between the segmentation methods.
Results:
For two data sets, IRF detection yielded promising sensitivities (0.98 and 0.94, respectively) and specificities (1.00 and 0.98) but less consistent agreement with the ground truth (dice coefficients 0.81 and 0.59); likewise, SRF detection showed high sensitivity (0.86 and 0.98) and specificity (0.83 and 0.89) but less consistent agreement (0.59 and 0.78). PED detection on the first data set showed moderate agreement (0.66) with high sensitivity (0.97) and specificity (0.98). IRF detection in a third data set yielded less favorable agreement (0.46–0.57) and sensitivity (0.59–0.68), attributed to image quality and ground truth grader discordance.
Conclusions:
The Pegasus automated fluid segmentation algorithms were able to detect IRF, SRF, and PED in SD-OCT b-scans acquired across multiple independent data sets. Dice coefficients and sensitivity and specificity values indicate the potential for application to automated detection and monitoring of retinal diseases such as age-related macular degeneration and diabetic macular edema.
Translational Relevance:
The potential of Pegasus-OCT for automated fluid quantification and differentiation of IRF, SRF, and PED in OCT images has application to both clinical practice and research.
The analysis was performed on masks generated for every OCT b-scan within each data set representing the location of fluid identified by the Pegasus system. These were then compared to a further set of masks containing the “ground truth” fluid locations within every b-scan as determined by the data set owners. All analyses were performed using custom scripts in MATLAB R2019a (The MathWorks, Inc., Natick, MA, USA).
Dice coefficients
17 were used to assess agreement between the manual segmentation provided by the data set owners and the Pegasus system. These were calculated as follows:
\begin{eqnarray*}{\rm{Dice}}\,{\hbox{coefficient}} = 2\left( {{\rm{A}} \cap {\rm{B}}} \right)/\left( {{\rm{A}} + {\rm{B}}} \right),\end{eqnarray*}
where A represents the manual segmentation, B represents the automated segmentation, and A∩B is the number of common pixels between the two sets.
Sensitivity and specificity values were calculated based on comparing the presence or absence of fluid against the ground truth for each individual b-scan within the overall data set or for individual eyes where sufficient data were available. Sensitivity was determined based on the number of slices where any fluid was identified by Pegasus that corresponded to a slice where the ground truth also identified any fluid (true positive; TP), divided by this number (TP), plus the number of slices where Pegasus did not identify any fluid contrary to the ground truth (false negative; FN), that is, sensitivity = TP/(TP + FN).
Specificity was determined based on the number of slices where neither Pegasus nor the ground truth identified any fluid (true negative; TN), divided by this number (TN), plus the number of slices where Pegasus identified fluid contrary to the ground truth (false positive; FP), that is, specificity = TN/(TN + FP).
Dice coefficients, sensitivity, and specificity were determined separately for each data set overall (A, B, and C) and “per eye” for data sets B and C, where volumes contained sufficient b-scans (
n = 11 and
n = 24, respectively) to perform this within-eye analysis. Fluid type (e.g., SFR, IRF, and PED) was analyzed separately for each data set where ground truth data (manual segmentation) was available. For data set B, the OCT volume for every eye contained b-scans with at least one subtype of fluid in the ground truth, but not all volumes contained both fluid subtypes (i.e., IRF and SRF); hence, it was not possible to calculate Dice coefficients, sensitivity, or specificity for these individual eyes (see
Supplementary Table S1). IRF was evaluated on all three data sets, SRF was evaluated on data sets A and B, and PED was evaluated on data set A only.
The Pegasus automated fluid segmentation algorithms showed excellent sensitivity (0.98 and 0.94) and specificity (1.00 and 0.89) for data sets A and B, respectively, but a relatively low sensitivity for data set C (∼0.64). Within data set C, performance was particularly poor on subject 9, in whom there was very little overlap between the experts and automated segmentation. However, on a per-eye level for this data set (all 11 b-scans per eye considered together), all eyes were correctly identified as having IRF on at least one b-scan.
For data set B, only 7 of the 24 eyes contained IRF in the expert grading, but the Pegasus system identified 15 of the eyes as positive for IRF, yielding a high false-positive rate. Several of these were identified on a single b-scan only per eye, and 3 were cases of serous PED incorrectly labeled by the algorithms as IRF (
Fig. 2). There were no eyes in which the algorithms failed to detect IRF where it had been identified by the two experts (i.e., false negative rate = 0). On a per-eye level for data sets B and C, IRF detection had perfect sensitivity (1 for both), but the specificity for data set B was poorer (0.53). It was not possible to assess specificity for data set C on a per-eye level, since there were no true negatives for IRF.
A comparison of the manual segmentations performed by the two expert graders in data set C showed limited agreement (mean Dice coefficient 0.59; comparable to that of the Pegasus software, at 0.57 and 0.45 for experts 1 and 2, respectively). Sensitivity for fluid detection was high between experts (0.94), but specificity was low (0.59).
Evaluation of PED was only possible on data set A, since it was the only data set labeled for this feature by the expert graders. The sensitivity and specificity were excellent (0.97 and 0.98, respectively), with reasonable agreement in segmentation between the expert and automated segmentations (mean Dice coefficient 0.66). Despite being the largest data set available, a per-eye level analysis was not possible, as there were too few b-scans available per eye (only a single b-scan in many cases).
In this article, we assessed the detection and segmentation performance of the Pegasus automated fluid segmentation algorithms on three existing independent OCT data sets. These data sets contained up to three distinct types of fluid: IRF, SRF, and PED. Pegasus performance on each data set was benchmarked against expert-derived ground truth, with the algorithms providing good but not always consistent performance across the three data sets and fluid subtypes.
Overall, Pegasus’s ability for automated detection of all three fluid types based on sensitivity and specificity was favorable and compares well to figures reported in contemporary literature. For data set A, performance was very high for IRF and PED while slightly poorer for SRF, the same pattern presented by Lu et al.
7 Our results outperform those of Rashno et al.
15 for data set B in terms of sensitivity (0.81 versus 0.96) with a similar specificity found (0.91 versus 0.89). No sensitivity or specificity values to allow comparison were presented by Chiu et al.
14 for data set C.
Pegasus provided promising fluid segmentation performance, based on Dice coefficients, across the three fluid types evaluated. Data set A uniquely allowed evaluation on all three fluid subtypes, but the mean Dice coefficients were lower in the present study than those presented by Lu and colleagues
7 (0.59–0.78 versus 0.75–0.9). Both studies reported the best performance for IRF compared to the other two fluid subtypes, and likewise for data set B, although only a single Dice coefficient of 0.82 was reported by Rashno et al.,
15 despite the investigation of both IRF and SRF. For data set C, our Dice coefficients were comparable to those reported in the literature,
14 including a limited interobserver performance during ground truth labeling. The authors attribute this to reduced image quality and the severity of DME included.
14
The RETOUCH project presents the performance of eight different deep learning algorithms for (a) automated fluid detection and (b) automated fluid segmentation.
18 A comprehensive review of algorithm performance in these tasks for each of the three fluid subtypes, plus a table of other relevant works, can be found elsewhere. For the fluid detection task, the results were comparable for PED and SRF (on data set B), although Pegasus relatively outperformed on IRF (on data sets A and B) and underperformed on SRF (on data set A) and IRF (on data set C). For the fluid segmentation task, the mean Dice coefficients for Spectralis images were 0.69 for IRF, 0.57 for SRF, and 0.68 for PED. Again, these are comparable to Pegasus’s average performance across the three data sets. For this task, Pegasus relatively outperformed on SRF (particularly on data set B) and IRF (on data set A) but underperformed on IRF (on data sets B and C). Similarly to RETOUCH, the Pegasus system yielded a high performance for detection of PED, likely due to its distinct appearance and location in relation to the RPE.
It should be noted that all OCT images used in this evaluation were acquired with a Spectralis SD-OCT (Heidelberg Engineering). We are therefore unable to comment on the performance of these fluid segmentation algorithms on images acquired with other OCT devices. However, the Pegasus-OCT system is intended to be OCT device independent and has been developed and trained based on multiplatform OCT data. Of note, RETOUCH
18 found that best performances were achieved when neural networks were trained on data from each OCT device separately, highlighting the trade-off between device-specific performance and algorithm generalizability. The results of the present study are comparable to much of the contemporary published literature, including algorithms trained on data from single OCT platforms, which demonstrates the promise of the Pegasus system.
The interexpert comparison for data set C provides a mean Dice coefficient (0.59) that was not appreciably higher than either expert grading versus the Pegasus segmentation (0.57 and 0.45 for experts 1 and 2, respectively). It should be noted that the data set owners attributed their expert graders’ performance to “lower image quality and severe DME pathology present.”
14 The interexpert agreement for this data set is notably lower than other contemporary studies with Dice coefficients reported in the region of ∼0.75,
9,18 although not uniquely so,
19 with the severity of retinal pathology as an important factor. This suggests the interexpert variation and its influence on the ground truth is likely to be a pertinent factor in the Pegasus system's performance. The greater differences between mean and median Dice coefficients (see
Table 2), and wider Inter Quartile Range (IQR) attributable to this data set (C), compared to A and B, would be consistent with increased outliers due to poor agreement on individual scans. Furthermore, the higher sensitivity and limited specificity values for the interexpert comparison in relation to the Pegasus system could speculatively be explained if an internal “bias” within the expert clinicians for overdetecting (rather than underdetecting) retinal fluid was present. The clinical experience of each expert and how they balance the relative risk of a false-positive versus a false-negative result on clinical outcomes may influence their approach to the manual fluid segmentation of the images.
Visual inspection of labeled images suggests expert graders provide a more “granular” demarcation of the regions of fluid compared to the algorithm that appears smoother (
Fig. 1); these discrepancies, while small, would have contributed to reducing the Dice indices (agreement). The absolute fluid area is also likely to be affected; therefore, consistency of method (i.e., not used interchangeably) would be advised in applications such as progression monitoring, where small absolute differences may have clinical significance.
We also caution that the data sets used for validation in this study consisted of eyes with the most common pathologies characterized by fluid (AMD and DME). Fluid segmentation performance in eyes with other retinal pathologies or comorbidities (e.g., macular hole, epiretinal membrane, central serous retinopathy) was not evaluated due to restricted data availability for these conditions, which were independent of the algorithm's training.
A “true” sensitivity and specificity was only possible to evaluate on data set A, since this data set was the only of the three to contain b-scans from eyes entirely absent of retinal disease. Data sets B and C did not contain any healthy eyes, confirmed as having no fluid visible on any b-scan by expert graders. For these two data sets, performance metrics were evaluated on a “per-b-scan” basis only, rather than a “per-eye” basis as in data set A, which would be the usual clinical scenario.
The number of b-scans per eye included in the analysis also differed between data sets. This should be considered when interpreting these results, since data sets with a higher ratio of number of b-scans to number of eyes (e.g., data set B) will likely have greater homogeneity, which may influence algorithm performance. Data set A had by far the largest number of eyes and included both AMD and DME. It might therefore be considered that this data set provided the greatest challenge for the Pegasus-OCT system in variation of disease appearance.
This study is not typical of a classical automated feature detection task (i.e., presence or absence of a distinct image feature); rather, it represents a more complex feature detection and subtype classification task, with the subtypes having a similar appearance in many cases. This may be particularly problematic in low-quality images, where dark or low-contrast portions of the image resulting from signal loss (e.g., from media opacities) may be mistakenly identified as fluid. This, in part, may explain the limited performance of the algorithm for certain metrics and fluid subtypes.
Overall, the Pegasus-OCT fluid segmentation algorithms were shown to be capable of detecting IRF, SRF, and PED within spectral domain OCT images of independent data sets from multiple sites. The competitive performance against other published methods demonstrates the potential of this approach as a clinical tool for the autonomous detection and monitoring of retinal disease such as AMD and DME.
Disclosure: L. Terry, None; S. Trikha, Visultyix Ltd (E, I); K.K. Bhatia, Visultyix Ltd (E), Metalynx Ltd (E); M.S. Graham, Visultyix Ltd (E); A. Wood, None