**Purpose**:
The purpose of this study was to accurately forecast future reliable visual field (VF) mean deviation (MD) values by correcting for poor reliability.

**Methods**:
Four linear regression techniques (standard, unfiltered, corrected, and weighted) were fit to VF data from 5939 eyes with a final reliable VF. For each eye, all VFs, except the final one, were used to fit the models. Then, the difference between the final VF MD value and each model's estimate for the final VF MD value was used to calculate model error. We aggregated the error for each model across all eyes to compare model performance. The results were further broken down into eye-level reliability subgroups to track performance as reliability levels fluctuate.

**Results**:
The standard method, used in the Humphrey Field Analyzer (HFA), was the worst performing model with an average residual that was 0.69 dB higher than the average from the unfiltered method, and 0.79 dB higher than that of the weighted and corrected methods. The weighted method was the best performing model, beating the standard model by as much as 1.75 dB in the 40% to 50% eye-level reliability subgroup. However, its average 95% prediction interval was relatively large at 7.67 dB.

**Conclusions**:
Including all VFs in the trend estimation has more predictive power for future reliable VFs than excluding unreliable VFs. Correcting for VF reliability further improves model accuracy.

**Translational Relevance**:
The VF correction methods described in this paper may allow clinicians to catch VF worsening at an earlier stage.

^{1}

^{–}

^{4}Whereas VF tests help clinicians monitor longitudinal changes in glaucoma disease trajectory, they are associated with known variability, making interpretation of changes difficult.

^{5}There is growing evidence that mean deviation (MD) values depend on reliability indices, such as false positive (FP) percentages, false negative (FN) percentages, and test duration. Worse levels of these reliability metrics make the determination of change over time more challenging.

^{6}By correcting for the effects of these reliability indices on algorithms used to determine change, it may be possible to more accurately identify disease worsening and improve patient care accordingly.

^{7}In general, as unreliability increases, the measured MD value deviates further from the true MD value. False positives have the largest effect on this MD error, followed by FNs and test duration. The Guided Progression Analysis (GPA) used in the Humphrey Field Analyzer (HFA) displays a linear regression for MD values over time that does not “correct” for poor reliability.

^{8}Rather, the GPA model – henceforth called the “standard model” – simply excludes “unreliable” VF MD values, defined as VFs with more than 20% fixation losses or 15% FPs, from the MD values over time regression in order to better predict future MD values from past reliable VFs.

^{9}

^{7}and (3) weighting MD values by their reliability. We compare the performance of these three models – henceforth referred to as “unfiltered,” “corrected,” and “weighted,” respectively – to the performance of the “standard” model (used in GPA) by predicting MD values of future reliable VFs. Because we wish to forecast MD values which are as close to a true MD value as possible and in order to compare our results to the standard model and other models used for MD forecasting,

^{10}

^{–}

^{14}we restrict our analysis to forecasting future reliable VFs.

^{15}Specifically, we included eyes which had five or more VFs obtained with the Humphrey Field Analyzer (HFA II; Carl Zeiss Meditec Inc., Dublin, CA) using the Swedish Interactive Threshold Algorithm (SITA) Standard test protocol and the 24-2 pattern. Patients could have either one or both eyes included in the analyses.

^{16}

^{7}Table 1 provides the average effect that each reliability index has on ΔMD =

*measured*MD −

*true*MD. To predict “true MD” from measured MD values, we used Table 1 to add up the effects of FP, FN, and test duration. Fixation loss percentages were also available, but they were not used to estimate the level of unreliability as they have been shown to not significantly affect the MD values.

^{7}For example, an eye with moderate glaucoma that had 10% FPs, 0% FNs, and a duration that was 30 seconds longer than the average for moderate eyes would have an expected ΔMD = 10 × 0.073 + 0.5 × − 0.35 = 0.55 dB. Assuming the measured MD value was − 8.25 dB, we would expect that the true value is

*true*MD =

*measured*MD − ΔMD = −8.25 − 0.55 = −8.80 dB.

^{7}Because we are primarily interested in predicting future reliable VFs, we restricted the error threshold that defines reliability even further. Thus, we only included eyes in the analysis where the final VF was a “gold standard VF” defined as having an error less than 0.25 dB (i.e. |ΔMD| ≤ 0.25 dB). For each eye, we calculated the percentage of VFs which were labeled as unreliable (|ΔMD| > 0.25 dB). Then, we divided eyes into subgroups based on the percentage of visits with unreliable VFs. The subgroups were: 0% unreliable, 0% to 10%, 10% to 20%, 20% to 30%, 30% to 40%, 40% to 50%, and eyes with more than 50% unreliable VFs.

^{9}The unfiltered model also used measured MD values as the dependent variables but did not exclude any points for the regression. The corrected model used “corrected MD” (i.e.

*measured*MD − ΔMD) as the dependent variable; individual MD values were first corrected and then fit with a regression line. The weighted model used measured MD values as the dependent variables but weighed MD values by their reliability. The precise form of the weight for the

*i*th MD value is \(weight\ of\ {\rm{M}}{{\rm{D}}_i} = \frac{\gamma }{{\gamma + {{| {\Delta {\rm{M}}{{\rm{D}}_i}} |}^\rho }}}\) where γ and ρ were optimized to fit the data using a 2D grid search over the entire dataset. That is, we searched over a large range of non-negative values and chose the γ and ρ, which minimized the average difference between the measured MD values for the final VF and the estimated MD values from the weighted model, across all eyes.

^{17}This procedure was repeated for each reliability subgroup as well as for the overall dataset. Last, we performed the same regressions and residual calculations/comparisons for subsets of the dataset with varying upper bounds of percentage unreliability. All analysis was done using Python version 3.7.

*P*< 10

^{−16}), whereas both were roughly 0.13 dB better (

*P*< 10

^{−16}in both cases) than the unfiltered model. However, a shortcoming of the weighted model is the need to optimize the γ and ρ parameters to derive weights of the weighted regression.

**P**on the x-axis includes all eyes were at most

**P**% of their VFs were marked as unreliable. Note, as the percentage of maximum unreliability increased, the differences between each regression method widened. For eyes with no unreliability, the corrected regression performed slightly better than the weighted and unfiltered regressions. Yet, as more unreliability is introduced, the weighted regression begins to outperform the corrected and unfiltered regressions. The standard deviation graph shows that as more unreliable VFs are included in the analysis, the distribution of the weighted regression experiences the smallest variance. This signifies that as unreliability increases, the residuals are growing in magnitude and experiencing more variance across all eyes. Even so, the weighted regression has stronger predictive power than both the corrected and unfiltered regression methods. The weighted model also exhibited greater precision in its estimates. The average size of the 95% prediction interval for the weighted model was 7.67 compared to 8.70 for the corrected model and 8.80 for unfiltered linear regression.

^{18}Their experiment design was different in that their models used the entire VF as input with the goal of predicting a future VF. After testing hundreds of models, their best performing model achieved a mean MD difference of +0.41 dB. Note, this value appears to reflect the average of the raw residuals in the forecasting model. Our reported results look at magnitudes of the residuals which are likely superior for calculating model error, as magnitude of error could be large, but if it is perfectly symmetrical then the average raw residual would be 0 dB which is uninformative. If we were to compare our mean raw residuals, our corrected and weighted models achieve a mean MD difference of +0.01 dB and −0.02 dB, respectively. Garcia et al. (2019) used Kalman filtering to predict MD values 5 years into the future and were able to predict within 2.5 dB for the majority of eyes.

^{12}Although the prediction time interval is much wider than in our study (most of the MD values we predict in our analysis were 1 year into the future), our model performance was better in terms of MD error.

^{9}as there is practically no difference in trend estimation (VFI versus MD over time) for most eyes.

^{19}Finally, we note that the 95% prediction interval (PI) for all models were relatively large, which means that these models may not be very useful for predicting future measurements.

**G.A. Villasana,**None;

**C.Bradley,**None;

**T. Elze,**None;

**J.S. Myers,**None;

**L. Pasquale,**None;

**C.G. De Moraes,**None;

**S. Wellik,**None;

**M.V. Boland,**Carl Zeiss Meditec (C);

**P. Ramulu,**None;

**G. Hager,**None;

**M. Unberath,**None;

**J. Yohannan,**None

*Arch Ophthalmol*. 2002; 120(6): 701–713; discussion 829-830. [CrossRef] [PubMed]

*Ophthalmology*. 2008; 115(9): 1557–1565. [CrossRef] [PubMed]

*Ophthalmology*. 1999; 106(4): 653–662. [CrossRef] [PubMed]

*Arch Ophthalmol*. 2002; 120(10): 1268–1279. [CrossRef] [PubMed]

*Community Eye Health*. 2012; 25(79-80): 66–70. [PubMed]

*Ophthalmology*. 2017; 124(11): 1612–1620. [CrossRef] [PubMed]

*JAMA Ophthalmol*. 2015; 133(1): 40. [CrossRef] [PubMed]

*The Field Analyzer Primer: Effective Perimetry*, Fifth Edition. Dublin, CA: Carl Zeiss Meditec USA, Inc.; 2012.

*Ophthalmology*. 2004; 111(9): 1627–1635. [CrossRef] [PubMed]

*Sci Rep*. 2019; 9(1): 8385. [CrossRef] [PubMed]

*JAMA Ophthalmol*. 2019; 137(12): 1416–1423. [CrossRef] [PubMed]

*Jpn J Ophthalmol*. 2016; 60(5): 383–387. [CrossRef] [PubMed]

*Ophthalmology*. 2020; 127(3): 346–356. [CrossRef] [PubMed]

*Br J Ophthalmol*. 2008; 92(4): 569–573. [CrossRef] [PubMed]

*Clinical Decisions in Glaucoma*. New York, NY: Elsevier Mosby; 1993.

*Biometrics Bulletin*. 1945; 1(6): 80–83. [CrossRef]

*PLoS One*. 2019; 14(4): e0214875. [CrossRef] [PubMed]

*Am J Ophthalmol*. 2008; 145(2): 343–353. [CrossRef] [PubMed]