**Purpose**:
The number of subjects needed to establish the normative limits for visual field (VF) testing is not known. Using bootstrap resampling, we determined whether the ground truth mean, distribution limits, and standard deviation (SD) could be approximated using different set size (*x*) levels, in order to provide guidance for the number of healthy subjects required to obtain robust VF normative data.

**Methods**:
We analyzed the 500 Humphrey Field Analyzer (HFA) SITA-Standard results of 116 healthy subjects and 100 HFA full threshold results of 100 psychophysically experienced healthy subjects. These VFs were resampled (bootstrapped) to determine mean sensitivity, distribution limits (5th and 95th percentiles), and SD for different ‘*x*’ and numbers of resamples. We also used the VF results of 122 glaucoma patients to determine the performance of ground truth and bootstrapped results in identifying and quantifying VF defects.

**Results**:
An *x* of 150 (for SITA-Standard) and 60 (for full threshold) produced bootstrapped descriptive statistics that were no longer different to the original distribution limits and SD. Removing outliers produced similar results. Differences between original and bootstrapped limits in detecting glaucomatous defects were minimized at *x* = 250.

**Conclusions**:
Ground truth statistics of VF sensitivities could be approximated using set sizes that are significantly smaller than the original cohort. Outlier removal facilitates the use of Gaussian statistics and does not significantly affect the distribution limits.

**Translational Relevance**:
We provide guidance for choosing the cohort size for different levels of error when performing normative comparisons with glaucoma patients.

^{1,2}Normative data are typically generated by recruiting healthy subjects, and the distribution limits of the normative data are empirically determined using conventional statistics.

^{3–10}Thus, identification of VF defects is contingent upon the characteristics of the normative distribution, and may differ across instruments and various research- and clinically based populations.

^{11}A sample size that is too large may represent an unwise use of resources—time, personnel, and cost—where a smaller cohort may provide effectively the same result, and “big data” may confound analysis through the introduction of unexpected biases.

^{12}

^{10,13}Thus, we determined whether the presence of outliers affects the distribution of normal VF sensitivities. Finally, we considered whether the normative distribution limits obtained from both the original cohort and bootstrapped data sets returned similar numbers and depths of VF defects, ‘events', in a cohort of patients with glaucoma.

^{4,13}

^{3,4,10,13–15}We continue to use this method in the present study to pool data within our cohorts by converting all sensitivity results in a point-wise manner into a 50-year-old equivalent patient.

^{14,16}Here, the null hypothesis was that there is no significant age-related effect, meaning that sensitivities may simply be grouped and pooled together. When sensitivities (in dB) were plotted as a function of age (in years), linear regression analysis found slopes that were significantly different to zero at all locations, indicating an age-related effect on sensitivities. Thus, to mitigate the effect of age, we age-corrected all sensitivities to that of a 50-year-old equivalent patient, as per previously published methods and using the slopes of change found using the above regression analysis (Supplementary Fig. S1).

^{3,10,13–15}These slopes were slightly higher (by 0.11 dB per decade,

*t*= 3.306, df = 53,

*P =*0.0017) than those found by Heijl et al.

^{10}using the full threshold procedure, and may reflect the extra modulation factor used in SITA.

^{17}Thus, we also examined the sensitivities of a prospectively recruited cohort of 100 healthy subjects (mean age: 36.5 ± 15.7 years; 46 males) who had undergone VF testing on the HFA using the full threshold paradigm and the 30-2 test grid. Each subject contributed only one test result, that is, a single average sensitivity at each of the 75 test locations (including the fovea and excluding the 2 points near the blind spot) within the 30-2 test grid) from one eye, following extensive prior perimetric testing experience. These results were also age-corrected into a 50-year-old equivalent subject for pooling and analysis.

^{13}

^{18}From the original cohort, we resampled a subset of the data (set size

*x*). Each resample could consist of the same subjects, as replacement was performed. This process was repeated

*k*(number of resamples) times.

*x*). For example,

*x*= 6 indicates a set of six sensitivity values resampled from the original total cohort (i.e., either the retrospective or prospective cohorts). Using the resampled sensitivities, we determined the mean (i.e., the central tendency), 95th percentile and 5th percentile (i.e., the distribution limits), and the standard deviation (SD). We tested a range of values for

*x*(for the retrospective cohort:

*x*= 6, 12, 24, 36, 48, 60, 75, 100, 150, 200, 250, 300, 350, 400, 450, and 500; for the prospective cohort:

*x*= 6, 12, 18, 24, 30, 36, 48, 60, 72, 85, and 100), which was capped at the total number of subjects in the cohort. To determine the confidence limits for the descriptive parameters from these set sizes, we tested two levels of

*k*(number of resamples), the number of resamples from the total cohort, which were

*k*= 100 and

*k*= 200. We also tested whether or not the value of

*k*affected the resultant descriptive statistics independently of the set size. Thus, for set sizes of

*x*= 6, 30, 60, and 500, we tested different levels of

*k*: 1, 4, 8, 12, 16, 20, 24, 28, 36, 48, 60, 72, 84, 96, 120, 150, 200, 250, 300, 400, 500, and 750. This bootstrap procedure was performed on a custom written macro program using Visual Basic Editor in Microsoft Excel 2010 (Microsoft Corporation, Redmond, WA).

*x*and

*k*. The difference (in dB) was plotted as a function of

*x*to determine the limit at which there ceased to be a significant change (i.e., when the difference between ground truth and bootstrapped parameters was minimized). A positive value in the difference from the ground truth and bootstrapped parameters indicates that the ground truth was higher than the bootstrap, while the converse was true for a negative difference.

^{19}In this method of robust nonlinear regression, it is assumed that variation around the curve follows a Lorentzian distribution, rather than Gaussian, which has wide tails and is less affected by outliers in the fit. Unlike least squares fitting, which quantifies the variance around the curve using S

_{y.x}(the SD of the residuals), robust nonlinear regression determines the 68.27th percentile of the absolute values of the residuals (1 SD from the mean in a Gaussian distribution), and this is called the robust SD of the residuals (RSDR). Each residual is divided by the RSDR, and this ratio approximates a

*t*distribution, from which a

*P*value can be determined. Outliers that are greater than the value Q (controlling the false discovery rate) are then removed.

*n*= 14, 0.05%), 1% (

*n*= 24, 0.09%), and 10% (

*n*= 119, 0.46%), and examined the sensitivities identified as outliers. As expected, a Q of 10% removed the greatest number of outliers at all locations; over 52 test locations, this equated to approximately 1.8 more values removed per location compared with the 1% level. Central tendency results were similar, but the variance was reduced when using Q = 10%. The Q = 10% condition removed points that there at least 3.3 SD away from the mean, equating to a

*P*value of 0.05%, which is the lowest level of significance flagged on the HFA total deviation and pattern deviation maps. Thus, in order to obtain data with the most likely outliers removed, we continue to report results using Q = 10%.

*x*and

*k*. Furthermore, the mean and distribution limits were also compared between the complete data set (i.e., including those points deemed to be outliers) and the cleaned data set. Here, we tested the hypothesis that the cohort inclusive of all points and the cohort trimmer of outliers return different results in descriptive statistics and also have different levels of

*k*at which no change from the ground truth parameter is obtained.

^{3,4}Intraocular pressures were not used as part of the diagnostic criteria. Inclusion criteria for analysis of their VF results were as per the healthy cohort described above.

^{3–5}This offered a practical method for assessing the normative cohort sizes required to result in the same level of performance in terms of defect identification as when using the ground truth parameters. We used 390 SITA-Standard VF results of 112 patients with open-angle glaucoma (mean age: 62.0 ± 12.6 years; 189 right eyes; 74 males; average mean deviation: −3.56 ± 3.89 dB), and determined the number and depth of ‘events' (difference in sensitivity from the mean in dB) flagged when using the 5th percentiles obtained from the ground truth and from the different levels of

*k*. We determined the level of

*k*at which there was no longer a significant change from the ground truth value. This analysis was performed when using both the complete data set and when outliers were removed.

*P*< 0.05 was considered significant. Difference between the values obtained using bootstrapping and the ground truth parameters were plotted as a function of set size (

*x*) or number of resamples (

*k*). The asymptotic point was determined by one-way ANOVA and multiple comparisons. When the multiple comparisons showed no more significant differences across adjacent and successive conditions (e.g.,

*x*= 60,

*x*= 72,

*x*= 96 are considered successive conditions), then the asymptotic point was reached.

*P*< 0.05 level, at least one of which is reduced below the

*P*< 0.01 level. However, as we were assessing different levels of percentile cut offs as a surrogate specificity value (e.g., the lower 5th percentile of the healthy cohort is effectively a 95% true negative rate, or specificity), for the purpose of this analysis, we regarded three or more points of sensitivity reduction below the cut-off as a criterion for a glaucomatous VF defect. This allowed determination of the true positive rate (i.e., test “sensitivity,” which was defined as the number of patients meeting the VF cut-off criterion of 3 or more points divided by the total number of glaucoma patients) for different true negative rates (i.e., “specificity,” set at each percentile cut-off level), and thus, ROC curves.

*P*= 0.2685), 95th percentile (

*P*= 0.2971), 5th percentile (

*P*= 0.2017), or SD (

*P*= 0.9282) compared with when the complete cohort was used (Fig. 2B).

*F*

_{3.14,47.1}= 1.12,

*P*= 0.3521). Therefore, we grouped the results of all test locations together for further analysis for different levels of

*x*and

*k*.

*k*= 100 condition, one-way ANOVA showed no significant effect of

*x*on the difference between ground truth and bootstrapped means (

*P*= 0.3521), but showed a significant difference in the 95th percentile (

*P*< 0.0001), 5th percentile (

*P*= 0.0001), and SD (

*P*= 0.0049) parameters (Fig. 3A). A similar tendency was found for

*k*= 200 (Fig. 3B). When outliers were removed,

*k*= 100 and

*k*= 200 showed the same effects as when the complete cohort was used (Figs. 3C, 3D).

*x*= 150. This was also examined using one-way ANOVAs and multiple comparisons, whereby the level of

*x*at which there was no significant difference across all further adjacent levels (e.g.,

*x*= 60 to

*x*= 72, and so forth having

*P*> 0.05) was taken as the asymptotic point. These results were generally consistent with this estimation (Table 1).

*P*< 0.0001), 5th percentile (mean difference: 0.13 ± 0.21,

*P*< 0.0001), and SD (mean difference: −0.14 ± 0.19,

*P*< 0.0001) values when comparing the complete cohort and the results when outlier sensitivities were removed (Fig. 4). The differences in the 95th percentile value were borderline in terms of statistical significance (mean difference −0.03 ± 0.11,

*P*= 0.0504). Despite the statistically significant differences, these were unlikely to be of any clinical significance, as the mean differences were well within instrument test-retest variability.

^{20}

*x*on the difference between 95th percentile (

*P*< 0.0001), 5th percentile (

*P*< 0.0001), and SD (

*P*< 0.0001) values, but not on the mean (

*P*= 0.1400) for

*k*= 100 (Fig. 5A). A similar tendency was found for

*k*= 200, and when outliers were removed (Figs. 5B–D).

*x*= 60 across all conditions. One-way ANOVA and multiple comparisons also generally agreed with this estimation (Table 1).

*k*) for four set size conditions:

*x*= 6, 30, 60, and 500 using data from the retrospective cohort (Fig. 6). There was no effect of

*k*on mean (mean

*P*value = 0.6086), 95th percentile (mean

*P*value = 0.4488), 5th percentile (mean

*P*value = 0.6697), or SD (mean

*P*value = 0.6296), expect for SD at the

*x*= 30 condition (

*P*= 0.0328). These results suggest that set size

*x*is more important than the number of resamples

*k*when attempting to minimize the difference to the ground truth statistic.

*x*= 6 to 300 and 400, respectively, and with

*k*= 100. We then compared the level of

*x*at which the multiple comparisons did not significantly change further, as described in the above analysis. For both

*n*= 300 and

*n*= 400, we found a level of

*x*that was similar to when

*n*= 500 (Table 1) was used:

*x*= 150 for 95th percentile,

*x*= 150 for 5th percentile, and

*x*= 60 for SD (Fig. 7). Thus, this suggests that differences are stable at this level of

*n*, rather than being proportional to the base “population” size used.

*k*= 200). One-way ANOVA showed a significant effect of set size

*x*(

*P*< 0.0001 across all conditions) on the number of ‘events' detected and depth of defect (Fig. 8). When using the 5th percentile from the original retrospective data as the ‘ground truth', smaller set sizes tended to overestimate the number of ‘events', and underestimated their depth, corresponding to higher 5th percentile and lower mean values (as per Fig. 3). Multiple comparisons showed significant differences across all bootstrapped conditions compared with the ground truth for ‘events' detected when the complete data set including outliers was used (

*P*< 0.0001), but showed an asymptote at

*x*= 250 when outliers were removed. There were similar asymptotes at

*x*= 250 and

*x*= 300 for the depth of defect for the complete data set and when outliers were removed, respectively. When setting the criterion to be a difference of one ‘event' flagged (as ‘events' are reported in integer values),

*x*= 60 (complete data set) and

*x*= 48 (outliers removed) were the minimum set sizes for which the difference between ground truth and bootstrapped values was less than one ‘event'.

*F*

_{1,160}= 0.0258,

*P*= 0.8727; Fig. 9). As expected, the AUROC was slightly greater when using a smaller set size,

*x*= 6, in comparison to the other conditions, as the resultant percentile cut-off values were higher under conditions of low specificity. However, there was no significant difference between ground truth and set sizes

*x*= 6, 200 and 500 (

*F*

_{3,160}= 0.1307,

*P*= 0.9417).

^{21}This technique has been used widely in VF research.

^{22–25}There are two main variables in our approach: the set size (

*x*) and the number of times a set is drawn (i.e., resampled [

*k*]). Therefore, the number of resamples of

*k*could potentially affect the bootstrapped statistics. For example, a large number of resamples of

*k*could mask the differences found with low levels of

*x*. However, we found no such tendency, and the results were similar across all levels of

*k*, suggesting that

*x*produces the differences seen between bootstrapped and ground truth values.

^{26}For determination of the magnitude of difference between two groups, such as a treated and control group, conventional statistics and power analyses are available, but these do not necessarily provide guidance as to the number of samples required to generate normative data to serve as a comparison group.

^{3–5}The addition of 20 subjects at a time would only add one more subject with which to define the 5th percentile. The results of the present study suggest that beyond approximately 150 to 200 subjects for SITA-Standard and 60 subjects for full-threshold VF results, estimates of the distribution limits are similar to that of the ground truth, and the addition of more subjects do not provide further information. This number may vary slightly depending on the composition of the cohort (e.g., perimetrically naïve or experienced observers) but show that only a smaller group of subjects may be required.

^{10}reported significant skew in the sensitivity data of healthy subjects; in comparison, recent work by Phu et al.,

^{13}having considered outliers, instead reported normally distributed sensitivity values. It is expected that outlier removal would, in particular, affect the lower limit of the normative distribution, but the question is by how much does this affect the result of normative comparison with a group of patients with disease? Our results show that although there is a statistically significant difference in the number of ‘events' found when using normative data with and without outliers included, this difference was unlikely to be clinically significant as it was smaller than the lowest integer value. This was also consistent with the minimal change in set size required to minimize the difference between ground truth and bootstrapped values. Therefore, the removal of outliers facilitates the use of conventional, Gaussian statistics, while not affecting the rate of VF defect detection.

^{9,10}However, the prospective phase of the study, where only one VF result from each subject was used, arrived at similar conclusions. Although we sourced a large number of VF results, the true population norms are not known. The sensitivity values were also age-corrected to a 50-year-old equivalent, which may be subtly different to deriving a normative database with age-matched healthy subjects. However, the change in sensitivity per decade at each test location was small, and the amount of age correction performed on our normative cohort was within the instrument's test-retest variability range.

^{20}

*n*= 300 and

*n*= 400 with the total “population” of

*n*= 500). While we showed that the set size

*x*was relatively unchanged at approximately 150 (instead of being a proportion relative to the total cohort size), this was still based on the assumption that

*n*= 500 provided a reasonable estimate of the total normal population. True norms across a range of ages would require a much larger study with a more diverse representation of the population.

^{27}Population characteristics would also be specific to the research question being asked. However, we also compared the VF sensitivity parameters of our cohorts with that of other published studies from different geographic locations, and found no significant difference (Table 2, one-way ANOVA, excluding the sets where SD was not reported:

*F*

_{1,1}= 0.646,

*P*= 0.5690).

^{5,28–30}This indicated that our cohort was likely to be robust and representative of a general, diverse population.

**J. Phu**, None;

**B.V. Bui**, None;

**M. Kalloniatis**, None;

**S.K. Khuu**, None

*. Dordrecht: Junk Publishers; 1985: 77–84.*

*Proceedings of the 6th International Perimetric Society Meeting**. 2006; 141: 24–30.*

*Am J Ophthalmol**. 2016; 11: e0158263.*

*PloS One**. 2017; 37: 160–176.*

*Ophthalmic Physiol Opt**. 2013; 54: 1345–1351.*

*Invest Ophthalmol Vis Sci**. 1997; 38: 426–435.*

*Invest Ophthalmol Vis Sci**. 2010; 128: 570–576.*

*Arch Ophthalmol**. 1990; 74: 289–293.*

*Br J Ophthalmol**. 1992; 110: 812–819.*

*Arch Ophthalmol**. 1987; 105: 1544–1549.*

*Arch Ophthalmol**. 2010; 36: 701–707.*

*Diabetes Educ**. 2014; 7: 342–346.*

*Clin Transl Sci**. 2017; 58: 4863–4876.*

*Invest Ophthalmol Vis Sci**. 1997; 75: 368–375.*

*Acta Ophthalmol Scand**. 2000; 41: 1774–1782.*

*Invest Ophthalmol Vis Sci**. 1998; 76: 165–169.*

*Acta Ophthalmol Scand**. 2005; 15: 209–212.*

*Eur J Ophthalmol**Bootstrap Methods: A Practitioner's Guide*. New York: Wiley; 1999.

*. 2006; 7: 123.*

*BMC Bioinformatics**. 2002; 43: 2654–2659.*

*Invest Ophthalmol Vis Sci**. 2011; 31: 123–136.*

*Ophthalmic Physiol Opt**. 2015; 4 (2): 10.*

*Transl Vis Sci Technol**. 2014; 3 (5): 6.*

*Transl Vis Sci Technol**. 2013; 54: 756–761.*

*Invest Ophthalmol Vis Sci**. 2014; 9: e98525.*

*PLoS One**. 2001; 42: 1411–1413.*

*Invest Ophthalmol Vis Sci**. 1989; 21: 58–60.*

*Fam Med**. 1999; 77: 125–129.*

*Acta Ophthalmol Scand**. 1999; 237: 29–34.*

*Graefes Arch Clin Exp Ophthalmol**. 1999; 40: 1152–1161.*

*Invest Ophthalmol Vis Sci*