Abstract
Purpose:
To determine the impact of different numbers of visual field tests per visit for detecting mean deviation changes over time in patients with early glaucoma or suspected glaucoma and to identify a practical approach to maximize change detection.
Methods:
Intrasession (n = 322) and intersession (n = 323) visual field results for patients with glaucoma or suspected glaucoma were used to model mean deviation change in 10,000 progressing and 10,000 non-progressing computer-simulated patients over time. Variables assessed in the model included follow-up intervals (0.5, 1, or 2 years), reliability rates (70%, 85%, or 100%) and number of visual field tests performed at each visit (one to four).
Results:
Two visual field tests per session compared with one provided higher case detection rates at 2 years (99%–99.8% vs. 34.7%–76.3%, respectively), reduced time to detection (three or four visits vs. six to 10, respectively), and more positive mean deviation score (−4 dB vs. −10 dB, respectively) at the point of mean deviation change identification, especially in the context of unreliable results. Performing two tests per visit offered similar advantages compared with more tests. False positive change detection rates (<2.5%), were similar across all conditions. Patients followed up 6 monthly had less severe mean deviation loss at follow-up compared to 1-year and 2-year follow-up intervals.
Conclusions:
Performing two tests per clinical visit at 6 months is practical using SITA-Faster and provides higher detection rates of mean deviation change in comparison with only one test performed per visit and more spaced-out intervals.
Translational Relevance:
This model provides guidance for selecting the number of tests per visit to detect mean deviation change.
Two cohorts of patients were used, patients who had undergone frontloading (two visual field tests) within the same clinical visit (the “intrasession” cohort) and patients who had historical longitudinal visual field data (the “intersession” cohort, which had at least one reliable visual field test per visit). Both cohorts were comprised of patients who were seen within the clinic as glaucoma suspects or as patients with manifest glaucoma.
The diagnosis of glaucoma was made as per current clinical guidelines.
2 In short, this required the presence of glaucomatous structural defects (for example, cupping, diffuse or focal rim thinning, adjacent retinal nerve fiber layer defects) with or without accompanying reproducible concordant visual field defects on the 24-2 test grid, in the absence of other retinal or neurological pathologies. Glaucoma-suspect subjects were those in whom one or more signs of glaucoma were present but their combination was insufficient for a diagnosis of glaucoma requiring therapeutic intervention. As per the clinical protocols of the Centre for Eye Health, the diagnosis was made by one examining clinician and by a remote review by another clinician.
16 For the patients with glaucoma, we selected the eye with the worse stage of glaucoma; for non-glaucoma patients, we randomly selected one eye for inclusion. Our goal was to focus on patients with early to moderate glaucoma due to the nature of the clinic from which patient data were derived, which limited the number of cases of advanced glaucoma included in the present study (defined as a mean deviation score worse than −12 dB). Nonetheless, the composition of the cohort suited the purpose of the study, because in more advanced stages of glaucoma other strategies for detecting functional change may have to be used (such as changing to the 10-2 test grid), as well as the consideration of factors such as the measurement floor effect.
The “intrasession” cohort included patients who had undergone two SITA-Faster tests per eye within the same clinical visit. These included a subset of patients who had been previously reported in our previous study (the Frontloading Fields Study).
11 For the purposes of the present study, we included only patients who had reliable results in both tests: <15% false positive rate, no seeding point errors (one or more primary seeding points with artificially reduced sensitivity in the absence of known pathology), and <20% of gaze tracker deviations exceeding 6°, as we have previously defined
11,12. We note that there is debate regarding the use of “traditional” reliability indices, such as recent work demonstrating the low contribution of elevated false positive rates to measurement variability.
17 For the purposes of the present study, the above criteria were chosen to reflect both the protocols of the clinic in which the data were collected and the automatic exclusion criteria from the commercially available Guided Progression Analysis linked to the Humphrey Field Analyzer hardware (Carl Zeiss Meditec, Jena, Germany). However, for the ensuing Methods and Discussion sections, we note that the “reliability” nomenclature does not refer to specific criteria, such as a specific false positive rate, which may become antiquated with time and emerging evidence. Instead, references to “reliable” and “unreliable” results represent occurrences in which user-based criteria, based on the best evidence available at the time, can be applied to afford a clinical judgment for interpretability. The resultant difference in mean deviation results for test one and test two was calculated for each patient, and the distribution of these differences characterized the intrasession test–retest mean deviation variability.
The “intersession” cohort included patients who had been seen more than once in the clinic. We extracted the visual field test results for patients who were clinically stable (including either glaucoma or suspect patients). Clinical stability was defined through clinical examination, with no evidence of structural or functional deterioration and thus no modification to the treatment plan, if applicable. This was used to reflect the impression of real-world clinicians with the available clinical data. We did not use a prespecified quantitative cut-off value for inter-test differences to identify clinically stable patients, as this would introduce bias. Thus, the quantitative differences observed between visits from the extracted data not only would reflect instrument-based factors but would also represent measurements captured in clinical practice. We extracted reliable SITA-Faster results from two adjacent visits for one eye. The difference in mean deviation results between the results from visit one and visit two was calculated for each patient, and the distribution of these differences characterized intersession test–retest mean deviation variability.
For both intrasession and intersession cohorts, we extracted and included the files of consecutive patients attending the glaucoma service to reduce the probability of selection bias within a 6-month period. Exclusion criteria included age <18 years, history of ocular trauma or surgery (aside from uncomplicated cataract surgery or selective laser trabeculoplasty), and the presence of macular or retinal pathology affecting the visual field (including age-related macular degeneration and diabetic retinopathy).
For both intrasession and intersession mean deviation variability, we also assessed its change as a function of average mean deviation severity, as previous reports have demonstrated increases in mean deviation variability with worsening stage of glaucoma.
18,19 The change in mean deviation variability was incorporated into the model for the purposes of identifying significant progression.
Each naïve simulated patient started off with no significant visual field defect, defined by a mean deviation score of 0 dB. This was to represent the earliest stage of glaucoma or glaucoma suspects as defined by current visual field staging systems,
2,14 in accordance with our purpose of examining the detection of mean deviation change in the earliest stages of glaucoma.
Our variables were progression rate (−0.5, −1, and −2 dB/yr), number of visual field tests per visit (1–4), follow-up intervals (6 months, 1 year, and 2 years), and proportion of reliable visual field results (100% reliable or 0% unreliable, 85% reliable or 15% unreliable, and 70% reliable or 30% unreliable, reflecting a range at which rates of low test reliability may occur in clinical practice using SITA-Faster
8,12,20).
A notable difference between the present study and previous modeling exercises
7,21,22 is the incorporation of low test reliability and its confounding effects. Note that these probabilities of low test reliability do not reflect specific cut-offs or parameters of reliability (for example, a specific false positive rate or gaze tracker deviations) but instead represent an approach where clinicians are compelled to discard unusable clinical data.
For the conditions where the proportion of low test reliability tests was >0%, we also assessed a “one in hand” approach, where one repeat test was conducted to overcome an instance of an unreliable result. In this instance, another test was conducted, replacing the result with reliability indices outside expected limits. All unreliable results were otherwise considered to be a “missing” data point for the purposes of the linear regression analysis (see below); thus, it was possible to have visits where the analysis was not performed and where change was undetected. Such patients were retained in the analysis and followed until change was detected or until the end of the follow-up period.
At each “visit,” a patient's mean deviation score was adjusted by mean deviation variability, beginning with the baseline visit, which can then increase with a worsening mean deviation score. To model this, we first determined the normality of distribution of mean deviation differences (intersession and intrasession). This would allow us to use a suitable model to extract out a random value from the underlying distribution. As each simulated patient was followed naively from baseline, they did not individually possess multiple visual field results from which individual variability indices could be derived. Instead, we relied upon population-based estimates of mean deviation variability as a function of mean deviation score to use for simulation purposes (see Discussion for further details).
23
The second step was to determine the relationship between the underlying average mean deviation score and the difference between tests (inter- and intrasession). A Bland–Altman analysis would be able to visualize the change in difference, but the linear regression result would not adequately capture the heteroskedasticity. Instead, we used a sliding-window analysis to overcome the limitations associated with “gaps” in mean deviation score within our cohort. We have previously used this method to overcome similar gaps in age-related analyses,
24 where granular, independent-variable, real-world data (in the present study, mean deviation) are impractical to obtain. In brief, we ordered the average mean deviation result consecutively and generated standard deviation values from groups of 10 adjacent difference values. The resultant standard deviations were plotted as a function of average mean deviation result, and their relationship was used to generate adjusted mean deviation variability scores for the model.
In combination, the simulated mean deviation score at each visit was the combination of the ground-truth score (baseline plus the product of rate of change and follow-up duration) and the intra- or intersession mean deviation variability, adjusted by the severity of mean deviation score. Then, for each patient, a linear regression analysis was performed at each visit to determine if at that visit there was a significant downward trend in mean deviation score, defined as a negative slope statistically significantly different to 0 using an F-test at the P < 0.05 level of significance (two-sided, which would return a predicted false positive level of 0.025 for a negative slope). Although we recorded the slope value for mean deviation over time, we did not require that the slope be at or statistically significantly lower than the ground-truth rate of change (specified as our variable progression rates of −2, −1, or −0.5 dB/yr, or 0 dB/yr for false positives) to be identified as changed, only that it be negative. This was to reflect clinical practice in which it would be desirable to identify a patient who demonstrates any statistically significant visual field deterioration at follow-up visits, given certain rates of underlying mean deviation change. Another difference with conventional automated analyses is the minimum data requirement for change analysis, with some commercial tools requiring or recommending a minimum number of data points. We only required that a linear regression analysis and corresponding extractable P value was available.
Figure 2 shows the proportion of cases detected by each number of tests per session as a function of follow-up time for a simulated progression rate of −2 dB per year. Each combination of reliability (rows) and follow-up interval (columns) conditions is shown in a separate panel. Because there was no substantial improvement in detection with the one-repeat condition, these results are shown in
Supplementary Figures S2 to
S4. Extra sum-of-squares
F-tests applied to each condition showed that no single asymmetric sigmoidal function fit all of the data (
P < 0.0001 for all); thus, there were significant differences in the times at which cases were detected. We show the individual curves for each progression rate separated by follow-up intervals (6 months, 1 year, and 2 years) in
Supplementary Figures S2 to
S4.
The time (in years) to detect 95% of cases within the cohort was plotted as a function of the number of visual field tests per visit for each condition (
Fig. 3). Similarly, the number of cases detected at 2 years was also plotted as a function of tests per visit for each progression rate (−2, −1, and −0.5 dB/yr, cumulative cases detected) and for false positives (at the 2-year visit) (
Fig. 4). Because the 2-year follow-up interval resulted in significant delays in case detection, especially when one visual field test was done per visit, we have not shown these data in
Figures 3 and
4. Several common themes are evident from both of the figures. First, shorter intervals between follow-up visits, faster progression rates, and higher reliability rates reduced the time to detect mean deviation changes and increased the number of cases detected at 2 years. There was some improvement in case detection when the “one in hand” approach (repeat testing of unreliable results) was used, but the difference between the one test per visit and the frontloaded approaches remained similar (
Supplementary Figs. S2–
S4).
Second, there was a notable plateau effect in benefit with an increasing number of visual field tests performed per visit, with two to four tests per visit demonstrating similar times to detection and proportions of case detection. For example, close to 100% of cases are predicted to be detected with two, three, or four tests per visit for −2-dB/yr progressors that demonstrate 100% test reliability. The benefits to performing more than two tests per visit were most apparent under conditions of lower rates of reliability, among slower progressors, and where longer intervals between visits occur. The overall false positive rates were near the predicted rate of 2.5% as described in Methods. The rates were similar across conditions, with a tendency for slightly lower rates with more tests performed.
Across the simulated cohort, the average test times per eye per session were 2.7 minutes, 5.5 minutes, 8.2 minutes, and 11.0 minutes for one, two, three, and four tests per visit, respectively.
Is More Better? Conducting More Visual Field Tests Per Visit to Overcome Testing Limitations
Similar to previous modeling work,
7 the present study makes several assumptions for the purposes of this modeling exercise (see
Table 1). For example, the model assumes that glaucoma functional change measured using the mean deviation result is linear and monotonic. In reality, patients may demonstrate abrupt change following an exponential decay due to the course of their disease and its management. Nonlinear, exponential worsening of visual field indices has been previously discussed by others.
34,35 In the context of our models, it would follow that nonlinear models may predict more severe long-term visual field loss, and this would expectedly favor shorter follow-up intervals to “catch” abrupt and rapid changes before they significantly affect the patient. Similar to other modeling work, this requires empirical evaluation to determine its real-world validity and translation.
Another assumption made is that the variability characteristics for two frontloaded visual field tests can be extrapolated to more than two tests. It is possible that additional variability factors such as fatigue may have magnified contributions. If that were the case, we postulate that the diminishing returns with more than two results per visit would be more apparent, supporting our main finding that two frontloaded visual fields may be sufficient. This hypothesis would require a different study design to fully test.
Our model used patients with suspected or early glaucoma with an assigned mean deviation score of 0 dB as they progressed toward moderate and more advanced stages of glaucoma, as this cohort of patients is most commonly seen in clinical practice,
36 and among these patients changes in a 24-2 global index such as mean deviation may be more relevant. In more advanced stages of glaucoma, more variables must be considered, such as increased variability at pointwise locations and the measurement floor effect,
31 as well as the potential need to change to alternative testing grids such as the 10-2.
37 Despite its limitations, mean deviation remains a metric that is correlated with other important patient-related outcomes, such as quality of life.
38 Other useful progression analysis methods may focus on pointwise change (such as change probability maps),
39,40 in part captured by the mean deviation slope, but this is a focus of other simulation studies. As described in Methods, modeling the correlations in pointwise sensitivity may be useful for deriving robust mean deviation scores
15 and may also advantageously reflect a greater diversity of scotomata expected in glaucoma.
Furthermore, mean deviation variability estimates were derived from the overall population, rather than from the individual patient. Understanding the variability characteristics at the individual level may have an impact on estimates of change trajectory, due to the heterogeneity in variability within the population.
23 In real-world practice, this could be obtained from patients with pre-existing clinical data. We note that, as a framework, the variables in the present model can be adjusted to further assess different levels of mean deviation and their associated variability and unreliability rates.
Although we report on practical recommendations for number of tests per visit and visit interval, there are jurisdictional differences in recommendations for eye examination frequency,
2,41,42 which at times may not align with the intervals that we examined, including the need to perform other glaucoma-related tests that may not be coincident with visual field testing. Our goal was to provide guidelines as to when clinicians might predict a significant change to occur based on the progression rate and number of tests conducted.
We assessed a commercially available algorithm, SITA-Faster, for the implementation of frontloading. By understanding the benefits of having multiple sensitivity readings within a session, future alternative thresholding algorithms could incorporate methods for “interleaving” presentations to return such data, without the need to repeat the test. This is an area of ongoing interest in developments in perimetry.
Supported in part by a grant from National Health and Medical Research Council Ideas (1186915 to MK and JP), and a University of New South Wales Faculty of Science Early Career Academics Network Seeding Grant (JP). Guide Dogs NSW/ACT provided funding for the clinical services enabling data collection for this study. The funding organization had no role in the design or conduct of this research.
Disclosure: J. Phu, None; M. Kalloniatis, None