In this work, we evaluated three implementations of a hierarchical Bayesian model for VF progression. We separately investigated the effect of modeling the censored nature and the heteroskedastic behavior of VF data. The model for heteroskedasticity was based on a large test-retest dataset of glaucoma patients. The hierarchical models were then compared to more traditional approaches using a large clinical dataset with long VF series and permutations of stable VF series from an independent test-retest dataset. Our results highlight several important aspects of VF progression analysis, that are often overlooked.
In terms of clinical performance, the largest gains were provided by the multilevel approach combining different locations. Indeed, all three hierarchical models performed very similarly in detecting progression. This result is useful to interpret previous findings obtained with similar approaches. Zhu et al.
10 also used a Bayesian approach to model VF progression. This method, called ANSWERS, was more focused on modeling heteroskedasticity (discussed later), but the parameter used to detect progression was a combination of the results obtained at different locations to calculate an estimated “probability of no progression.” In the evaluation of the method, the differential effect of modeling heteroskedasticity and combining locations was not quantified. Interestingly, when compared to simple linear regression, we obtained an improvement in performance very similar to that of ANSWERS for all the hierarchical models tested in our analysis, re-enforcing the idea that the strength of these methods comes from efficient combination of pointwise information. Importantly, we used the same dataset as Zhu et al.
10 to obtain stable VF series to determine specificity, so our results can be easily compared. Of note, ANSWERS did not use a multilevel approach to obtain an estimate of the global progression slope but rather relied on a combination of pointwise regression models fitted individually (although linked through spatial correlations). This is similar in principle to the S statistics used by PoPLR, which also improved the clinical performance over simple linear regression in our dataset.
Other authors have successfully used a multilevel approach, similar to ours, to estimate a global progression slope.
14,15 Betz-Stablein et al.
15 in particular used the estimated posterior distribution on the global slope to assess progression. In their analysis, the hierarchical model did not show a large improvement over pointwise methods. However, they relied on clinical judgment to assess specificity in their calculations instead of permutations of stable series. Their results are therefore difficult to compare to ours. Nevertheless, modeling the global progression rate has additional advantages besides practical progression detection because it has a meaningful and direct interpretation for clinicians, and this is an additional strength of our approach.
Another novel aspect of our approach is the handling of the error distribution and response variability. Similarly to ANSWERS, we based our modeling on a test-retest dataset. In our case, however, the dataset used to model variability and the one used to generate permutated stable series were different, and this adds strength to our validation. Test-retest variability, especially at a pointwise level, is known to increase at lower sensitivities.
4,5 This relationship is often explained with a change in the psychometric function
4,5 at more damaged locations. Although this is certainly an incomplete characterization,
5 sensitivity has been shown to be the best predictor of variability in glaucoma.
4,5 Henson et al.
4 modeled the psychometric function with a cumulative Gaussian function and reported the change of log(SD) with sensitivity, down to 10 dB. We adopted a similar modeling approach using a large test-retest perimetric dataset. The coefficients of our linear approximation were in good agreement with the results from Henson et al.
4 (
Appendix). Importantly, our approach was meant to model the expected variability around the predicted sensitivity, allowing us to rely on previous knowledge of the psychophysics of perimetric responses. In contrast, ANSWERS used test-retest data to model the variability of the observed response,
10 effectively using the estimated variability as a weighting method for the observations. However, in our case, modeling heteroskedasticity only improved the results in the detection of progression for individual clusters and locations (
Fig. 5) but did not improve the global performance.
One important feature of our hierarchical models is the handling of censored data. Different approaches have been proposed to address the 0 dB floor in perimetric measurements.
9,13,16,17 However, few have addressed the actual censored nature of the data. This is important, because it implies that the actual sensitivity could extend beyond the measurements limits. This aspect is willingly neglected by methods adopting an asymptotic modeling of the floor effect,
13,16,17 creating problems in the estimation and interpretation of the rate of progression. These methods are particularly problematic when considering that the censoring level is completely arbitrary and can be changed without any bearing on the measurements recorded above the chosen limit.
6 This is not captured by asymptotic models, such as the exponential decay, in which the estimated rate of loss is tightly linked to the relative distance of the observations from the measurement limit. On the other hand, considering censored values as actual measurements of 0 dB sensitivity can introduce a positive bias in the estimated rate of loss with simple linear models. This is demonstrated by our first experiment on artificially shifted series (
Fig. 3). It is important to notice that, at least in a dataset drawn from real clinics, such as the one used in this analysis, such a bias had little bearing on the ability of the models to predict future sensitivity, as demonstrated by our quantification of the MAE for predictions (
Fig. 6), because a significant effect is only obtained when a large number of trailing 0 dB observations accumulate in a series (see example in
Fig. 2). A similar result was reported by Bryan et al.
9
Not accounting for censoring can, however, significantly affect the accurate estimation of the rate of loss in patients with advanced VF damage, because the floor is likely to affect many locations earlier in the VF series. Betz-Stablein et al.
15 also accounted for the censored nature of the data. In contrast, ANSWERS used a transformation of 0 dB sensitivity because such a value would be undefined in a Weibull error distribution (the one used in their model).
10 This approach allowed modeling of the observed test-retest variability, but would still be affected by a bias of trailing 0 dB values. However, despite reducing the estimation bias, accounting for censoring did not greatly improve the clinical performance (
Fig. 4 and
Fig. 5). This result could have multiple explanations, including the fact that our clinical dataset was mostly composed of people with early loss (see
Table 1). This is representative of many glaucoma clinics, but in our case the number of people with early damage could have been inflated by our exclusion of eyes undergoing glaucoma surgery before 10 VFs could be collected. However, the most likely explanation is that models accounting for censoring correctly interpret censored observations as being less informative. This increases the uncertainty around the final estimate of the rate of progression, despite reducing the bias. Instead, noncensored models assume that complete information can be extracted from 0 dB measurements, leading to less uncertainty in the estimates. This is clearly illustrated by the example in
Figure 2: both censored models provide more negative estimates of global progression, but the P-score is almost identical to that obtained with the Hi-linear model on account of the greater uncertainty. The comparison between the Hi-linear and the Hi-censored model is particularly useful, because they constitute two implementations of the same model that only differ for their handling of censored data. Other practical solutions could be applied to increase the dynamic range of VF testing itself at the lower end, for example, by using larger stimulus sizes for more advanced stages of damage.
40
Finally, our approach differs from previous attempts in its modeling of spatial correlations within the VF. We opted for a full hierarchical approach, in which the VF clusters represented an intermediate level of the hierarchy. This has some important advantages, especially for interpretability, because the model can provide a rate of progression for each individual cluster. In fact, Bayesian computing allows inference on random effects, and this is useful to assess localized progression (
Fig. 5). Moreover, the rate of progression for clusters can be compared to structural measurements on optic nerve sectors for multimodal evaluations.
25,41 One drawback is that clusters are modeled as hard-edged groups instead of “smooth” correlations. Therefore, proximity of locations within the same cluster does not affect the correlation structure and adjacent locations in different clusters are modeled as completely independent. However, this also allowed us to avoid complex correlation structures in the model, greatly reducing the number of parameters and therefore improving efficiency, as opposed to previous similar attempts.
10,15 Moreover, when compared to results from ANSWERS, our discrimination performance seems very close to those obtained with spatial enhancement in their model. For example, with five VF tests, they reported a 2.6-fold improvement in the HR compared to simple linear regression at 95% specificity.
10 In our analysis, the improvement was 2.8-fold with the Hi-HSK model and 2.6 for the Hi-censored model. Note that results with 5 VFs are only reported here for comparison and are not part of our main analysis. Finally, we showed that our hierarchical structure with random effects retained enough flexibility so not to compromise its predictive ability (
Fig. 6). These observations, together with the improved interpretability of the model, make our approach reasonable and, to some extent, preferable. Further flexibility could be introduced with customized structure-function clustering techniques.
3,42
Other groups have proposed approximate Bayesian computation solutions to estimate VF progression, such as in the work by Murata et al.
43,44 Their evaluation mostly focused on the improvement in prediction accuracy and did not explore the effect of their model on progression detection. Their results are difficult to compare to ours since they based their modeling on total deviation values. Interestingly, their MAE for prediction was similar but generally lower compared to ours. However, this was also the case for the simple pointwise linear regression, possibly indicative of a difference in the composition of the datasets used for validation, such as a larger proportion of stable patients. One notable difference is that we calculated the prediction MAE for all subsequent VFs and not just the last one in the series, as in Murata et al.
43,44