Abstract
Purpose:
The purpose of this study was to evaluate the impact of two conventional reliability criteria (false positives [FPs] and seeding point errors [SPEs]) and the concurrent effect of low sensitivity points (≤19 dB) on intrasession SITA-Faster visual field (VF) result correlations.
Methods:
There were 2320 intrasession SITA-Faster VF results from 1160 eyes of healthy, glaucoma suspects, and subjects with glaucoma that were separated into “both reliable” or “reliable-unreliable” pairs. VF results (mean deviation and pointwise sensitivity) were analyzed against the spectrum of FP rates and SPE, with and without censorship of sensitivity results ≤19 dB. Segmental linear regression was used to identify critical points where visual field results were significantly different between tests due to FP levels.
Results:
There was a significant, but small (0.09 dB per 1% exceeding 12%) increase in mean deviation, and an increase in the number of points showing a >3 dB sensitivity increase (0.25–0.28 locations per 1% exceeding 12%). SPEs were almost exclusively related to a decrease in sensitivity at the primary seeding points but did not result in significant differences in other indices. Censoring sensitivity results ≤19 dB significantly improved the correlation between reliable and unreliable results.
Conclusions:
Current criteria for judging an unreliable VF result (FP rate >15% and SPE) can lead to data being erroneously excluded, as many results do not show significant differences compared to those deemed “reliable.” Censoring of sensitivity results ≤19 dB improves intrasession correlations in VF results.
Translational Relevance:
We provide guidelines for assessing the impact of FP, SPE, and low sensitivity results on VF interpretation.
Visual field testing remains an integral part of the clinical assessment of glaucoma, as it provides a means to diagnose, prognosticate, and determine the impact of the disease.
1,2 Recommendations for visual field testing have highlighted the need to obtain at least 6 results within the first 2 years to obtain a robust impression of disease progression or stability,
3 This recommendation reflects the inherent variability of results obtained during testing, on the part of patient- and instrument-related factors.
4
Although historical methods of perimetric testing have not been conducive to obtaining this many results in routine clinical practice due to the long test duration, recent algorithmic changes leading to faster testing protocols, such as SITA-Faster, have provided an opportunity to meet these recommendations.
5–7 The frontloading approach, doing more than one visual field test per clinical visit, has been proposed as a practical method for capturing sufficient clinical data to facilitate more confident clinical diagnosis and patient management.
8,9
SITA-Faster has been found to have a slightly greater propensity to return results that do not meet commonly used “reliability” criteria in comparison to its predecessor SITA-Standard.
6 Such results are often manually disregarded by the clinician or automatically by computer-based progression analysis techniques. Due to this limitation of apparent unreliability, clinicians may therefore question the value of SITA-Faster and its application in the frontloading approach, instead, potentially favoring SITA-Standard.
However, the impact of failing clinically used indicators of reliability, such as excessive false positives and seeding point errors, on the repeatability of frontloaded field tests is not well understood.
8,10 The question raised is whether all results that fail to meet reliability thresholds should indeed be discarded, or if there is still potential usefulness in at least part of the data. It is possible that such criteria, which are often regarded in a binarized pass/fail fashion, do not provide clinicians with the opportunity to retain, at least in part, some useful clinical data. For example, Heijl and colleagues
11 have recently presented findings suggesting that false positive metrics are not strongly associated with output perimetric indices using the SITA family of algorithms, recommending that historical cutoffs such as the 15% false positive limit should be revised. Overall, such questions raise the possibility that conventionally used parameters for assessing reliability may not truly reflect the usefulness of the test result.
The purpose of the present study was to examine the impact of two reliability metrics (false positive errors and seeding point errors) on pairs of frontloaded visual field tests performed within the same clinical visit using SITA-Faster. The central hypothesis was that there is a threshold at which reportedly unreliable clinical data remain comparable to that of its reportedly reliable counterpart. We used three approaches to test this hypothesis. First, we compared mean deviation and mean sensitivity found on reliable-unreliable pairs of visual field tests. We would then be able to determine if there was a threshold at which the difference in mean deviation or mean sensitivity began changing significantly. Second, we analyzed pointwise sensitivity changes across the visual field. Further to this, we also applied cluster analysis to determine if there were areas that were particularly likely to show greater differences in sensitivity in reliable-unreliable pairs. Third, we examined the effect of “correcting” test locations exhibiting low test reliability on the correlation between reliable-unreliable pairs. Alongside these approaches, we also examined the contribution of low sensitivity values (at or below 19 dB) on the correlations between results. Thus, in combination, these results would allow us to provide recommendations on extracting useful information from apparently unreliable visual field results obtained using SITA-Faster.
For the present study, we utilized data from consecutive patients who returned the following combinations of visual field results: both tests reliable, first reliable and second unreliable, or first unreliable and second reliable.
We specifically focused on two criteria that are used to identify results as unreliable in clinical practice, which might lead to a result being excluded from analysis. We note that our use of “reliable” or “unreliable” nomenclature specifically refers to the current clinical perception of these metrics, but not an objective measure of their ability to identify an uninterpretable visual field result. When performing our analysis, we instead use the terms “passed criteria” (as a surrogate for “reliable”) and “failed criteria” (as a surrogate for “unreliable”) with reference to the criteria set out below.
The first criterion was the presence of seeding point errors,
10 which is a relatively common occurrence arising due to the features of the SITA-Faster algorithm (
Fig. 1A). In this error, at least one of the four primary seeding points initially tested in the grid is abnormally low in sensitivity, in the absence of pathology. The result of this error is one or more isolated points of artificial sensitivity reduction that may consequently affect calculation of the hill of vision and global indices. We used the following definition: at least one at the
P < 0.0001, or the product of normative significance values of two or more points equal to or less than
P < 0.0001.
8
Second, false positive rates were extracted as a percentage reported by the instrument. A cutoff value of 15% or greater is typically used as a criterion for an unreliable result, as such, a cutoff was uncommonly seen in reliable perimetry (see
Fig. 1B–D).
16 Notably, this cutoff value is more stringent (lower) compared to the 33% limit used in older clinical trials.
17 We extracted the absolute false positive rate for an individual test, as well as the difference in false positive rate between unreliable (>15%) and reliable tests.
Notably,
Figure 1 shows 3 examples of elevated false positive rates that do not meet the 15% cutoff (
Figs. 1B, 45%;
1C, 18%;
1D, 31%). In
Figure 1B, most test locations had clearly elevated sensitivity results, whereas there were no instances where the sensitivity result was more than 3 dB above age-expected limits in
Figure 1C. In
Figure 1D, we show an example where false positive rates are elevated (31%), but with an arcuate pattern of loss that has sensitivity values less than or equal to 19 dB (which we defined as an alternate measurement floor
18,19 – see more below) where measurements are not expected to be highly repeatable.
We analyzed files that met only one of the above criteria, as multiple sources may confound each other. In addition, we did not analyze gaze tracker errors
20 in the present study, as these add another confounding layer of uncertainty, and also because it is a scalar, but not vector, measurement, rendering it difficult to elucidate its effect on the visual field measurement. Although current recommendations for assessing gaze tracker errors are largely qualitative,
21 for the purposes of the present study, we excluded visual field results where over 20% of gaze tracker deviations exceeded 6 degrees, as per our previous studies.
6,22
The role of the group of patients exhibiting visual field results that did not meet either criterion (i.e. deemed “passed criteria”) in both instances was two-fold. First, it would serve as the reference group for intra-session retest variability of global and pointwise measurements of interest to which the unreliable results could be compared (see more below). Second, given that the false positive unreliability criterion uses a cutoff of 15%, this would enable the analysis of a spectrum of lower false positive rates up until the cutoff point.
As per current clinical protocols at the Centre for Eye Health, all patients underwent visual field testing twice for each eye within the same test session. The order of testing was at the discretion of the administering technician, with rest breaks between each test as requested by the patient. All testing was performed using the Humphrey Field Analyzer 3 instrument, using the 24-2 test grid and SITA-Faster algorithm (Carl Zeiss Meditec, Dublin, CA).
Visual field data of interest were the right and left eyes (or only one eye in cases where the patient was monocular) results collected within the same clinical visit. A custom written Matlab program (The Mathworks, Natick, CA) was used to extract the following parameters of interest from each visual field printout: pointwise visual field sensitivity, mean deviation, pattern standard deviation, test duration, and false positive rate.
We examined the role of reliability metrics on visual field outputs using the following three approaches.
Approach 2: Analysis of Pointwise Sensitivity Results: Difference in Sensitivity and Cluster Analysis
Aside from examining the extent to which current reliability metrics can be used to identify altered visual field test results, we also sought to determine if methods for correcting for erroneous sensitivity measurements might provide more useful clinical information.
The relationships identified in approaches 1 and 2 might return methods for obtaining a correction factor which could be applied to unreliable results. To analyze this, we used mean sensitivity and measured changes in the model using the coefficient of determination and the width of the 95% prediction interval.
Because the calculation of mean deviation involves an instrument-specific, proprietary modulation on top of the individual's sensitivity result and is scaled across different visual field locations, we also calculated and compared mean sensitivity. Calculation of mean sensitivity has been previously detailed in other papers.
27–29 First, the conversion of the decibel value (dB) returned by the instrument to linear luminance threshold (in cd.m
−2;
Equation 1). Then, the linear contrast values were averaged to represent the linearized sensitivity. Finally, the average linear sensitivity was then converted back into a decibel value (
Equation 2).
\begin{equation}\Delta L = \frac{{3183}}{{{{10}^{\frac{{dB}}{{10}}}}}}\end{equation}
\begin{equation}Mean\;sensitivity = 10\; \times lo{g_{10}}\left( {\frac{{3183}}{{Average\;luminance}}} \right)\end{equation}
Of the 1575 eyes of 779 patients seen within the data extraction period, 913 (57.3%) had both fields which were reliable, 183 (11.5%) had first only “passed,” 293 (18.4%) had second only “passed,” and 186 had both “failed.” Out of the 476 eyes with a pair of “passed” – “failed” fields, 138 (29.0% within the total “failed” group) had false positive rate >15% and 109 (22.9% within the unreliable group) had seeding point errors alone, which were used for analysis (1826 both “passed,” 276 with false positive rate >15%, 218 with seeding point errors totaling 2320 visual field results analyzed).
The characteristics of the 1160 eyes analyzed in the present study are shown in the
Table. There were no differences in the distributions of age and diagnoses between groups. There were more women in the seeding point error group, and more left eyes showing both results having “passed criteria.” The latter result was expected, most likely due to a combination of learning, practice, and instruction effects, as we have previously discussed.
8 In brief, this was because whereas the perimetrists were permitted to test the eyes at their discretion, the right eye was tested first >90% of the time (724/779 patients contributing to “pass” and “fail” pairs noted above). After initial experience with the test, the left eye, tested second, returned more instances of “passing” the reliability criteria. The overall mean deviation value was lower in the seeding point error group, but the range was narrow, with no instances of patients with more advanced loss. This was likely due to the definition of seeding point errors, in which a prominent defect in the seeding point may be attributable to pathological loss. There were also differences in the distributions of ethnicities across the reliability categories.
Table. Demographic and Diagnostic Parameters of the Patients Whose Eyes Were Used for the Present Study, Categorized by Their Reliability Output
Table. Demographic and Diagnostic Parameters of the Patients Whose Eyes Were Used for the Present Study, Categorized by Their Reliability Output
Approach 2A: Analysis of Pointwise Sensitivity Results: Difference in Sensitivity
Approach 2B: Cluster Analysis Applied to Pointwise Sensitivity Differences Across the Test Grid
The present study sought to systematically determine the impact of two clinically used “reliability” parameters found in SITA-Faster visual field results, the elevated false positive rates and seeding point errors. An absolute and relatively higher false positive rate of 12% to 13% was associated with significantly higher sensitivity measurements and was also associated with a greater number of test locations with sensitivity values greater than 3 dB from the reference result. Seeding point errors predictably led to lower sensitivity measurements at the 4 primary seeding locations by approximately 2 to 3 dB. Despite the systematic characteristics of sensitivity changes arising from these two parameters, the differences were small in magnitude and thus would not be clinically significant. Thus, these results raise the question of whether perimetric results that fail these manufacturer-defined reliability criteria need to be excluded from clinical interpretation or progression analysis, and whether these criteria are antiquated. Irrespective of the error type, censorship of sensitivity results at or below 19 dB improved correlation between intrasession results.
Practical Recommendation 1 for Interpreting Reliability Metrics: False Positive Rates
Recently, Heijl and colleagues
11 examined the effect of false positive rates on intrasession perimetric results using the SITA algorithms, including SITA-Faster, in a cross-sectional setting. As expected, our results were similar to that of Heijl and colleagues,
11 where higher false positive rates were associated with higher mean deviation scores. Our rate of increase of mean deviation score per percentage point of false positive rates was slightly higher than Heijl and colleagues
11 (0.9 dB per 10% increase in relative false positive rates above 13%, or 1 dB per 10% increase in absolute false positive rates above 23%, compared to 0.3–0.6 dB per 10%). We note that we used a segmental linear regression which identified a larger change in mean deviation in excess of a false positive rate of 12% to 13%. Our “cutoff” point for when the difference in mean deviation increased significantly was at an absolute false positive rate of 23%, with the rate of increase in mean deviation difference remaining small. We postulate that this may also be because of the different ranges of false positive ranges, where our study had a larger range of false positive values (up to 45%), leading to a potentially more pronounced sensitivity elevation.
The rate of increase in mean deviation was slightly lower than that reported by Yohannan and colleagues.
33 However, we note that our methods were slightly different in several ways. We did not set the inflection point for a significant change a priori, allowing us to identify the point at which the slope change in the score as a function of false positive rate became statistically significantly different to 0. We also compared intrasession results using SITA-Faster (which has been documented to return higher rates of false positive rates
6,8), and our sample had a higher proportion of results with false positive rate >20% (approximately 5.5%).
Additionally, we found a statistically significant, but overall small and poorly explained relationship between number of test locations that returned 3 dB higher sensitivity on the high false positive results, again at approximately 13%. Similar to the effect on mean deviation, the effect of elevated false positive rates on the number of test locations with elevated sensitivity was small, with 0.25 to 0.28 locations per percentage point of the false positive rate in excess of an absolute false positive rate of 13%. When comparing intrasession visual field tests, the slope describing the number of test locations and difference in false positive rates was steeper (0.40–0.43 per 1%). Although this has implications for assessing the repeatability of cluster criteria in intrasession visual field tests, again, the difference in the false positive rate needs to be in excess of 12% between tests. Both mean deviation and sensitivity change data were fit using a segmental linear regression, rather than an exponential function. This was due to the relatively sparse sample of subjects exhibiting large magnitudes of mean deviation and sensitivity differences at the upper limits of false positive rates. A sample with more diverse visual field artifacts may serve further to explore this potential exponential relationship.
The slight tendency for more peripheral test locations exhibiting elevated sensitivity found on cluster analysis was unsurprising given the known greater effects of spatial uncertainty in those test regions.
34,35 However, the overall frequency of the elevated sensitivity results was low. At the individual level, this small magnitude of effect of false positives was seen through the lack of improvement in correlation following correction of sensitivity results.
In combination, the small effect size on the commonly used mean deviation and pointwise sensitivity across most of the visual field suggests that false positive rates should be regarded along a continuum. Therefore, based on ours and Heijl and colleagues’ results (who notably had a different study design),
11 there is evidence across different testing modalities that the historical precedent of 15% false positive rate as a cutoff for reliability should be reconsidered, as useful information can still be obtained for results with apparently high false positive rates. For example, if we expect that the difference in the mean deviation score between frontloaded tests to be within an illustrative magnitude of ±2 dB 95% of the time, a false positive rate of up to 32% (20% above the 12% level) might still be within the range of the expected variability. When combined with the relationship found between mean deviation difference and absolute false positive rate, the results suggest that one of the results would need to have a false positive rate of approximately 43% for a 2 dB difference in mean deviation (20% above the 23% level). At this false positive rate, one might also expect to see 8 to 10 test locations with falsely elevated sensitivity readings. Careful clinical examination of the sensitivity maps would be further revelatory.
However, the converse may also occur, whereby low false positive rates that are “within” the current cutoff of 15% may also be accompanied by artificial elevations in mean deviation. Such false positive rates also do not preclude falsely elevated sensitivity measurements. Thus, when assessing visual field results, false positive rates should be used to as a guide, rather than act as a dogmatic binarized pass-fail criterion.
Practical Recommendation 2 for Interpreting Reliability Metrics: Seeding Point Errors
Practical Recommendation 3 for Interpreting Reliability Metrics: Censorship of Points Reaching the Measurement Floor
Practical Implications for Frontloading and Obtaining “Reliable” Visual Field Data
Our consecutive sampling approach was performed to minimize selection bias, and to reflect the real-world probability of error identification in visual field testing. As such, our sample did not have a high prevalence of more advanced cases of glaucoma and vision loss. In relation to the cohort tested, there were some differences in the distributions of self-reported gender and ethnicity in the reliability outcomes, but the reasons for this were not explored in the present study. For this reason, our comments and recommendations are not directly applicable to cases of advanced visual field loss, nor do we provide information on the impact of these metrics as a function of the magnitude of defect. Importantly, significant false positive errors may mask specific scotomata, confounding interpretation and progression analysis. Patterns of falsely elevated sensitivity results that are inconsistent with age-expected normative values, historical data, or structural parameters – even independent of the false positive metric – should alert clinicians to the possibility of a false positive result.
We also focused on two distinct reliability metrics: elevated false positive rates and seeding point errors. These error types were predicted to return opposing effects on sensitivity: false positive rates cause higher sensitivity and seeding point errors leading to lower sensitivity. There may be interactions between these metrics, as well as contributions from other metrics, such as gaze deviations. This analysis would require a more complex approach and would benefit from future study.
Elevated false positive rates and seeding point errors are common occurrences in SITA-Faster, with worsened correlations between results driven significantly by sensitivity readings less than or equal to 19 dB. However, injudicious application of historical elevated false positive rates and seeding point errors may erroneously exclude useable clinical data. We therefore provide the following three recommendations:
- Recommendation 1: Reconsider the historical precedent of 15% false positive rate as a cutoff for reliability, as higher rates of false positives may still produce useful results. Because the converse may also be true (low false positive rates may be accompanied by falsely high mean deviation results), a dogmatic approach to false positive cutoffs is not recommended.
- Recommendation 2: The presence of seeding point errors should prompt clinicians to disregard erroneously seeded points, but not necessarily to disregard results at other test locations. Repeating the test is specifically recommended when erroneously seeded points are relevant to cross-sectional or longitudinal interpretation of potential scotomata.
- Recommendation 3: Censorship of sensitivity values at or below 19 dB improves intrasession pointwise sensitivity correlations between visual field results. However, results at or below 19 dB should still be integrated into pointwise progression analysis and global metrics like mean deviation.
In situations where metrics, such as elevated false positive rates or seeding point errors, are found to confound accurate visual field interpretation cross-sectionally or longitudinally, the recommendation remains to repeat the test to attempt to overcome these artifacts. Although, in many instances, their effects may be small, recognition of these artifacts, especially in the context of the patient's aggregate clinical findings, remains critical for both automated and manual methods of visual field interpretation.
Supported in part by an NHMRC Ideas Grant to M.K. and J.P. (1186915), and a University of New South Wales Science Early Career Academic Network Seeding Grant to J.P. Guide Dogs NSW/ACT provided funding for the clinical services enabling data collection for this study. Guide Dogs NSW/ACT also provides salary support for J.P. and M.K. and support for clinical service delivery at Centre for Eye Health, from which the clinical data was derived. The funding body had no role in the conception or design of the study.
Disclosure: J. Phu, None; M. Kalloniatis, None