A total of 630,000 OCT images from 2371 patients across 10,450 patient visits were extracted. The images were split into training, validation, and held-out test sets at the patient level as described in
Figure 1. The baseline demographic factors for the study population are shown in the
Table. Each of the 31 ensemble networks was constructed based on the methods described in
Figure 2. An end-to-end deep learning approach was attempted but led to inferior performance compared to the random forest ensemble. All training was performed by a pre-trained VGG-BN-16 network for 50 epochs with a patience of 10 epochs, and the best model was saved based on the lowest validation loss obtained (
Supplementary Fig. S2). All images were fed again into a saved best model for feature extraction at the image level. For the first model, feature vector concatenation was not performed, as there were no adjacent images used. For the 31st and final model, 0.0 padding was not performed, as the concatenated vectors resulted in the maximum size of 15,616 × 1. Each ensemble network was run 10 times based on randomly set seed values. Incomplete OCT scans were removed from the dataset as per the methods described previously.
We achieved AUROC curves of 0.9556, 0.9735, and 0.8887 and AUPR curves of 0.9514, 0.9530, and 0.8203 from the central foveal images for AMD, DME, and POAG, respectively. We achieved AUROC curves of 0.9718, 0.9895, and 0.9211 and AUPR curves of 0.9691, 0.9838, and 0.8749 from the full 7.50-mm coverage for AMD, DME, and POAG, respectively. A plot of the percent change in the calculated AUROC and AUPR curves of each model with respect to the first model (
Fig. 3) revealed varying trends in information gain and points of diminishing return among the three diseases. The AUROC and AUPR values for each model were individually normalized to 0% and 100% by the first and last model, respectively. The point of diminishing return was found by applying a running three-model consecutive average with a threshold set as ≥90% of the maximum gain as obtained from the 31st model. The point of diminishing return is reflected in
Figure 3 by the border of the gray-shaded background. AMD and DME were found to have the most gain in AUROC performance from the central 2.75 mm (14 B-scans) and 4.50 mm (21 B-scans) of coverage before diminishing returns, respectively, whereas POAG had continued information gain up to 6.25 mm (28 B-scans) of volume coverage. The results for AUPR followed a similar trend, with AMD, DME, and POAG having information gains up to 4.00 mm (19 B-scans), 4.25 mm (20 B-scans), and 4.50 mm (21 B-scans), respectively, before diminishing returns. Taking into account the Bonferroni correction of 180, all models except for the 0.25-mm (three B-scans) model, were found to have significantly higher AUROC and AUPR values for all three disease states.
Figure 4 demonstrates by colormap the ROC and PR curves of the 31 models as a function of increasing B-scan coverage on a scale of blue to red, with blue representing the central foveal slice only with 0.00-mm coverage and red representing the full 61 set with 7.50-mm coverage. Across all models, the overall trend demonstrated increases in both AUROC and AUPR curves as macular coverage increased from 0.00 mm (one B-scan) to 7.50 mm (61 B-scans).