For image-level self-censorship, our results in
Table 3 showed that applying self-censorship to either model individually improved glaucoma classification performance compared to using either model individually without self-censorship. As summarized in
Table 3, the average performance of the individual standard DL and Dirichlet models degraded by 36.2%, 20.2%, 12.1%, and 11.0% on the OOD datasets, respectively, compared to their average performance on the IND UIC dataset. Notably, applying selective self-censorship to improved accuracy by 10.13% on UIC, 9.58% on RIMONE-DL, 21.25% on O-RIGA, 27.23% on REFUGE, and 14.92% on LAG datasets, averaged across the individual models without self-censorship. This improvement, however, came at the cost of censoring 15%, 37%, 20%, 34%, and 22% of data from the IND UIC and OOD RIMONE-DL, O-RIGA, REFUGE, and LAG datasets, respectively, with the highest censorship on RIMONE-DL, likely due to its ONH-cropped images differing from the full-view UIC fundus images used in training, shown in
Table 1. Exclusively on the attained uncensored data selected by the ensemble + self-censorship model,
Table 4 compares the performance of the proposed image-level self-censorship against baselines. On this subset of data, our self-censorship model achieves 0.0%, 6.8%, 13.7%, 9.7%, and 3.3% higher accuracy than the averaged accuracy of the standard DL and Dirichlet baselines across the internal UIC, and external RIMONE-DL, O-RIGA, REFUGE, and LAG datasets, respectively. Additionally, even individual baseline models benefit from self-censorship, with their accuracies increasing by 5.6%, 2.3%, 3.4%, 10.5%, and 7.6% on these datasets, after uncertain cases are removed. Thus, the image-level self-censorship approach balances improved glaucoma classification accuracy with the percentage of retained data.