**Purpose**:
To describe a new stereotest in the form of a game on an autostereoscopic tablet computer designed to be suitable for use in the eye clinic and present data on its reliability and the distribution of stereo thresholds in adults.

**Methods**:
Test stimuli were four dynamic random-dot stereograms, one of which contained a disparate target. Feedback was given after each trial presentation. A Bayesian adaptive staircase adjusted target disparity. Threshold was estimated from the mean of the posterior distribution after 20 responses. Viewing distance was monitored via a forehead sticker viewed by the tablet's front camera, and screen parallax was adjusted dynamically so as to achieve the desired retinal disparity.

**Results**:
The tablet must be viewed at a distance of greater than ∼35 cm to produce a good depth percept. Log thresholds were roughly normally distributed with a mean of 1.75 log_{10} arcsec = 56 arcsec and SD of 0.34 log_{10} arcsec = a factor of 2.2. The standard deviation agrees with previous studies, but ASTEROID thresholds are approximately 1.5 times higher than a similar stereotest on stereoscopic 3D TV or on Randot Preschool stereotests. Pearson correlation between successive tests in same observer was 0.80. Bland-Altman 95% limits of reliability were ±0.64 log_{10} arcsec = a factor of 4.3, corresponding to an SD of 0.32 log_{10} arcsec on individual threshold estimates. This is similar to other stereotests and close to the statistical limit for 20 responses.

**Conclusions**:
ASTEROID is reliable, easy, and portable and thus well-suited for clinical stereoacuity measurements.

**Translational Relevance**:
New 3D digital technology means that research-quality psychophysical measurement of stereoacuity is now feasible in the clinic.

^{1}Several clinical stereotests exist, including the Randot, Randot Preschool, Frisby, TNO, Titmus, and Lang stereotests.

^{2,3}All of these share certain disadvantages: (1) consisting of cards or plates, they offer only a number of discrete levels; (2) they admit monocular cues, especially if the head is moved or tilted

^{4,5}; (3) there is a nonnegligible chance of passing a level by guessing.

^{3}Given these limitations, these tests are not used to assess stereoacuity in nonclinical vision research. Instead, for many decades it has been standard to use computers to present arbitrary stimuli, run adaptive techniques such as Bayesian staircases, and/or fit psychometric functions to data.

^{6–8,9}For these reasons, over the last few years several groups have proposed computerized stereotests that aim to bring laboratory-quality psychophysics to the clinic.

^{10}uses 3D shutter glasses to present a stereotest consisting of a two-alternative disparity-discrimination task of a line stereogram. They used a PEST (parameter estimation by sequential testing)

^{11}staircase to estimate stereo threshold from 100 presentations. Hwang and colleagues,

^{12–14}from Seoul National University College of Medicine, also used shutter glasses to present a stereotest consisting of a four-alternative disparity-detection task in a random-dot pattern. They used a classic clinical paradigm in which at least two out of three presentations at a given level must be answered correctly in order to progress. Breyer et al.

^{15}used an autostereoscopic monitor along with eye tracking to measure whether children successfully fixated a disparate random-dot target at one of four possible locations. This was intended as a test for the presence of stereovision and did not obtain a threshold measurement; the target had a constant disparity of 1200 arcsec. Hess et al.

^{16}present a stereotest on an iPod using red-green anaglyph glasses, using a two-alternative disparity-discrimination task with a random-dot pattern. Threshold estimates were obtained with a staircase procedure. None of these tests are as yet commercially available, and although code for some of them is publicly available, to our knowledge none have yet been used by a laboratory other than their originators' in a scientific publication. This may be because none of the tests combine ease of use for patients (e.g., in game format, not requiring glasses) and clinicians (easy to run, no specialist knowledge of displays or set up procedure required). Thus, while there is widespread awareness of the shortcomings of current clinical stereotests, the same old-style stereotests continue to be used in the clinic.

^{17}The computer controls the disparity via a Bayesian adaptive staircase, enabling a threshold to be obtained with high statistical efficiency. The front camera is used to monitor the viewing distance and correct the stimulus for changes in viewing distance, meaning that patients are free to hold the device how they wish rather than requiring clinician intervention to maintain the correct distance. Most importantly, the stereotest task is embedded in a fun game that uses colors, sounds, and animation to keep children engaged and responsive.

^{18}We believe that this combination of features offers unique advantages, which are not achieved either by current clinical stereotests or the research tests described in the previous paragraphs.

^{4}, where

*D*

_{I}refers to the parallax in physical pixels of the screen and

*D*

_{H}refers to parallax in pixels of the half-images we seek to depict, as discussed by Serrano-Pedraza et al.

^{4}These are related by Equation 1 (also, equation 1 of Serrano-Pedraza et al.

^{4})

*D*

_{H}= 0, the parallax is

*D*

_{I}= 1 physical pixel (Fig. 2A).

*p*is the size of a physical pixel in centimeters (Fig. 2). The geometrical distance from the viewer of a virtual object with parallax

*P*is

*I*is the interocular distance and

*V*is the viewing distance, that is, the distance from the viewer to the screen, marked in Figure 2B.

*Δ*arcsec, the H parallax must be

*D*

_{H}is screen parallax in H-pixels, 2

*p*is the width of one H-pixel (i.e., twice the width

*p*of one physical I-pixel on the screen), and

*V*is the viewing distance in the same units as

*p*. The term 1/3600 converts from arcseconds to degrees and π/180 converts from degrees to radians.

*D*

_{H}changes by 1,

*D*

_{I}changes by 2). This corresponds to an angular disparity of 2

*p/V*radians. This is the minimum relative disparity between the target and the background that can be depicted without special techniques to achieve subpixel disparity (compare Fig. 2A versus Fig. 2B, where disparity differs by a single H-pixel). On the Commander 3D, each pixel has a physical width of

*p*= 0.114 mm and therefore subtends an angle of 0.0164° = 59 arcsec at a viewing distance of 40 cm. The minimum whole-pixel relative disparity is therefore 118 arcsec at 40 cm. Note that this is twice what one might imagine based on the screen resolution without the additional complication of column interleaving.

^{19}The dots were made colored rather than white simply to make the stimulus more attractive. Since the luminance pattern alone unambiguously defined disparity, there is no reason to expect the color to affect stereo thresholds.

^{20}

*Δ*between the target and the background. In order to avoid monocular cues to the location of the target, this was always applied symmetrically; that is, the background had disparity

*Δ*/2 and the target had disparity −

*Δ*/2, as indicated in Figure 3. Serrano-Pedraza et al.

^{4}provide an in-depth discussion of this and other techniques we used to eliminate monocular artifacts in this stimulus.

^{4}The stimulus continued until a response was made.

*Δ*between the target and background, we first compute the desired screen parallax in pixels of each eye's half-image, H-pixels

*D*

_{H}, according to Equation 4. We scatter dots uniformly across each patch. To apply the background parallax of +

*D*

_{H}/2, we add

*D*

_{H}/4 to the

*x*-coordinate of each dot in the right eye and subtract

*D*

_{H}/4 from the

*x*-coordinate of each dot in the left eye. In the same way, we generated a patch of dots for the target, but this time we subtract

*D*

_{H}/4 from the

*x*-coordinate of each dot in the right eye and added (

*D*

_{H}/4 − 1) to the

*x*-coordinate of each target dot in the left eye. The reason for the additional 1-pixel leftward shift in the left eye is to avoid monocular artifacts due to the column interleaving, as explained by Serrano-Pedraza.

^{4}Before we draw the dots, we remove any background dots that would be occluded by the target, taking into account the shift of the target in each eye.

*whole*dot from a half-image whenever its center falls within the occluded region. Geometrically, this means that each individual dot is either entirely on the front surface or the back surface, and thus the edges of our target surface are slightly ragged, varying by up to one dot width.

^{4}Antialiasing was used in ASTEROID versions up to 0.933 and again from version 1.00. Both approaches have their pros and cons. Antialiasing is more accurate when the point-spread function of the eye is large enough relative to a pixel, since it then becomes indistinguishable from the desired stimulus.

^{21}However, this requires accurate luminance linearization. This would be hard to achieve on individual tablets, especially given that users might alter the display brightness and contrast, and we did not attempt it. Dithering is less accurate but potentially more robust. In practice, we did not find a significant difference between these two approaches, given the other sources of error discussed in the results. We have therefore combined results from all versions in this paper.

*a trial*. We refer to a sequence of several trials, resulting in a threshold estimate, as

*a test*.

^{25}of the target was reduced, making it more transparent, until after four correct answers it vanished and the target could be detected only via its stereoscopic disparity. Usually, all subsequent trials would be stereo only. However, as described below, the nonstereo cue could reappear in subsequent trials if the threshold estimate suggested that the participant was stereoblind. The aim was to keep performance at 75% correct, regardless of the participant's stereoacuity.

*ε*is the noise on that trial, a random variable drawn from

*P*

_{noise}, and

*b*is the inverse of the noise amplitude. Higher values of

*b*indicate lower noise and therefore greater sensitivity. Our model assumes that the target is detected if and only if this noisy signal exceeds an internal detection threshold

*A*,

^{22,23}that is, the noise

*ε*exceeds

*b*(

*A*− log

_{10}

*Δ*). The probability that this occurs is

*or*when they cannot perceive the target but guess correctly. Thus, for an ideal observer, the probability of a correct answer would be Ψ =

*P*

_{det}+ (1 −

*P*

_{det})

*g*, where

*g*is the probability of answering correctly by guessing (

*g*= 0.25 on our four-alternative task). However, humans are not ideal, and therefore even if the signal is well above threshold, there is a finite probability

*λ*that the observer will give the wrong answer anyway, for example, because their attention wandered or their finger slipped. Thus, we follow established practice and alter the previous expression to Ψ = (1 −

*λ*)

*P*

_{det}+ (1 −

*P*

_{det})

*g*=

*g*+ (1 −

*λ*−

*g*)

*P*

_{det}.

*θ*, to be the value of disparity

*Δ*needed to obtain a particular level of performance, Θ, on our task:

*θ*.

*θ*as

*Δ*is the disparity of the target relative to the background and

*θ*is the participant's task-specific stereo threshold, in arcseconds, defined as the value of

*Δ*for which the probability of selecting the target is Θ (so by definition Ψ(

*θ*) = Θ). In ASTEROID, Θ is chosen to be 0.75, that is, 75% correct. We discuss and justify this choice in the Results: Simulation 1. It is important to bear this in mind when comparing ASTEROID thresholds with those obtained on other tests.

*g*(for guessing), is 0.25 in our four-alternative task. The probability of giving the wrong answer even when the stimulus is well above threshold is

*λ*, conventionally called the

*lapse rate*.

^{24}Performance thus asymptotes at (1 −

*λ*). Realistically, lapse rates might vary from as low as 0.001 to as high as 0.1 for different participants, but the precise value of the lapse rate within this range does not usually affect threshold estimates.

^{24}We used

*λ*= 0.03, chosen to be in the middle of the realistic range.

*b*controls how rapidly performance improves as disparity is increased. We could attempt to fit

*b*for individual subjects, but this would require more trials than are clinically feasible. Thus, we fixed

*b*at a value we know to be reasonable from a previous study

^{25}:

*b*= 4.885/log

_{10}arcsec.

*Δ*= 1000 arcsec relative to the background. This disparity is highly visible to most people with normal stereo vision.

^{26}) of the cue frame is reduced by 0.14285, making it gradually more transparent. After each incorrect answer, the alpha value is increased by the same amount (capped at 1). After four trials have been answered correctly, the test moves out of the introduction and into the stereotest proper. If after 10 trials fewer than four trials have been answered correctly, we conclude that the participant is not capable of doing the task (whether because of poor understanding, motivation, or low vision), and the test terminates.

*Q*(

_{n}*q*).

*Q*(

_{n}*q*)d

*q*represents our estimate, after

*n*trials, of the probability that the participant's log threshold is

*q*(or more properly, that the log threshold lies between

*q*and

*q*+ d

*q*log

_{10}arcsec). We update this after each trial based on whether the participant's response on that trial was correct or incorrect.

*q*} represents the probability of the observed response (correct or incorrect) on the

*n*th trial, given that the threshold is

*q*, Pr{

*q*} the probability that the log threshold is

*q*, Pr{

*q*|response} the probability that the log threshold is

*q*given the observed response, and Pr{response} the probability of the observed response.

*Δ*;

*θ*), tells us the probability of a correct response, while (1 − Ψ(

*Δ*;

*θ*)) is the probability of an incorrect response, both for a threshold of

*θ*= 10

*. At the start of the*

^{q}*n*th trial, our current estimate of Pr{

*q*} is

*Q*(

_{n−1}*q*). After the

*n*th trial, it has been updated to

*Q*(

_{n}*q*) = Pr{

*q*|response}. Thus, Equation 10 tells us that

*Q*(

_{n}*q*), is updated on successive trials. We began with a flat prior: the distribution

*Q*

_{0}(

*q*) was set to the same value for all values of log threshold. Numerically,

*Q*(

_{n}*q*) was evaluated at 1000 equally spaced values from

*q*

_{min}= 0 to

*q*

_{max}= 3.56 log

_{10}arcsec.

*Q*(

_{n}*q*) as an estimate of the standard deviation of the sampling distribution of the threshold, that is, the standard error on the threshold estimate.

*Q*(

_{n}*q*) as described above (Equation 11).

*Q*(

_{n}*q*), as recommended by King-Smith et al.

^{27}The staircase terminated once 20 trials had been completed, not including trials during the initial introductory phase.

^{28–32}and the proportion will typically be higher in eye clinics. It is therefore important to ensure that the stereotest does not discourage stereoblind participants. For this reason, if the current threshold estimate exceeded 1200 arcsec, not only was the disparity capped at this value, but also the nonstereo cue used in the initial practice trials was replaced with full opacity. This served two purposes. First, it ensured that stereoblind participants could also perform well on the task. Second, it provided us with “catch trials,” enabling us to distinguish issues with cooperation or understanding due to problems with stereovision; that is, if a participant scores at chance on the stereo trials but perfectly on trials with a nonstereo cue, then we conclude that they are having particular problems detecting the disparity. But if a participant is at chance on the cued trials also, we conclude that they have not understood the task or are not motivated to perform it (e.g., they are a small child who is enjoying tapping the screen at random). ASTEROID thresholds above 1000 arcsec are not meaningful but indicate a participant performing at chance. In the figures, all thresholds above 1000 arcsec were replaced with a notional value of 1000 arcsec.

^{33}

*θ*corresponding to the stereo threshold of the model observer. In general, the parameters of the model observer (their lapse rate

*λ*, the steepness of their psychometric function as governed by

*b*) could be different from the values

*λ*= 0.03,

*b*= 4.885/log

_{10}arcsec assumed by ASTEROID; values are specified in the Results section.

_{10}stereo threshold. The Bland-Altman 95% limits of agreement are defined as the 95% confidence interval on this difference. Since the differences are close to normally distributed, this spans ±1.96 times the standard deviation of the differences,

*s*. That is, one can be 95% sure that the absolute difference between two log threshold estimates will be smaller than 1.96

*s*. This corresponds to a factor of 91 × 10

*in the stereo thresholds themselves.*

^{s}^{9}Both defined

*threshold*as the disparity needed to reach a performance of 75% correct. The problem is that for a 4AFC task, more signal is required to reach 75% than for a 2AFC task. The 2AFC thresholds will therefore be lower, even if everything else is identical. To quantify this, we rearrange Equation 8 into

*θ*

_{2}, obtained with

*g*

_{2}= 0.5, and

*θ*

_{4}, obtained with

*g*

_{4}= 0.25, for the same value of Θ = 75%. We assume that

*A*and

*b*remain the same in both cases. This assumption differs from a common way of modeling

*m*-alternatives in the literature.

^{34,35}That model assumes that an ideal observer selects the target if and only if the noisy signal, drawn from

*ϕ*(

*x*−

*d*′), exceeds all (

*m*− 1) samples drawn from noise distributions

*ϕ*(

*x*). This model also generates a psychometric function that rises from

*g*= 1/

*m*to optimal, but the slope of the psychometric function is nonzero at

*d*′ = 0. This is because with no internal threshold even vanishingly small amounts of signal help push performance above chance. In contrast, our model assumes that all subthreshold signals are equivalent and undetectable. This implies a psychometric function that is initially flat before rising from

*g*= 1/

*m*to optimal, which accords better with the psychometric functions actually observed, at least on a disparity-detection task. Additionally, the threshold-free model assumes that observers compare all

*m*values and pick the largest, whereas our model assumes that the target is detected if and only if the noisy signal exceeds the internal threshold, without the need for comparison. This better agrees with empirical evidence that the time taken to complete a trial grows only slowly with the number of alternatives, as if the target usually “pops out.”

^{17}While these details are not critical to our argument, this explains why we assume here that

*A*and

*b*in Equation 8 are independent of the number of alternatives, which would not be the case in a model without a threshold.

*λ*does change. Recall that

*λ*is the probability of answering incorrectly even for a stimulus well above threshold. If we assume a fixed probability

*λ*

^{*}of ignoring the signal, then

*λ*=

*λ*

^{*}(1 −

*g*).

*λ*= 0.03 for

*g*

_{4}= 0.25, which implies

*λ*

^{*}= 0.04. With Θ = 75% and

*b*= 4.885/log

_{10}arcsec, the right-hand side of this Equation evaluates to 0.151. That is, with our assumptions about the observer, log thresholds on a 2AFC task should be lower by 0.151 log

_{10}arcsec than log thresholds on the equivalent 4AFC task, corresponding to a factor of around 1.4 for thresholds in arcseconds.

*threshold*. The sweat factor or sweet point is the stimulus strength that minimizes the variability of the psychometric function for a given number of trials.

^{36–38}From this stimulus level, we can obtain the performance or proportion of correct responses of the psychometric function. For our 4AFC and the logistic function, this performance is 68% (without lapses) and a little lower, 66%, with

*λ*= 0.03 (values estimated by minimizing Equation A1 from Shen and Richards

^{39}). This is a low performance level for a clinical test aimed at children. For comparison, the Randot stereotest targets performance around 78% correct, whereas the TNO stereotest targets around 85%.

^{40}If we chose Θ = 66%, the staircase would be aiming to present disparities where the patient is wrong nearly half the time. We were concerned that this would be demotivating in a clinical test aimed at children. Yet, we were also keen to maximize precision and thus reliability. We therefore carried out simulations to assess the effect of different choices for Θ, the performance level defined as threshold.

*b*

_{M}. Unsurprisingly, thresholds are obtained with greatest precision (error bars are lower) for the observer with the highest slope (C:

*b*

_{M}= 14.654). As a function of Θ, regardless of observer slope, the lowest biases and standard deviations are obtained near Θ = 66%. The standard deviation becomes large for very low or very high Θ, and when the model slope is not as assumed by the staircase (A, C), biases are also possible for these extremes. However, across the range from Θ = 50% to 80%, there is in fact very little difference in the quality of the threshold estimates. We therefore chose a definition of threshold performance level toward the upper end of this range, Θ = 75%.

^{41}This is close to threshold performance levels targeted by current stereotests. It means that patients using ASTEROID will find they are correct around three times out of four, even toward the end of the test, helping to prevent discouragement. All subsequent results in this paper are for Θ = 75%.

^{42}Therefore, in a clinical context, thresholds must be estimated from very few trials. ASTEROID uses four practice trials, with a nonstereo cue, and stereo thresholds are estimated from the subsequent 20 stereo trials. This small number of trials strongly limits the test-retest reliability achievable.

^{43,44}for the same model observers. This is defined such that one can be 95% confident that the absolute difference between two observations will be less than the 95% limit of agreement (see Methods). This means that if the first observation is

*q*, then the second could be as high as

*q*+ limit of agreement or as low as

*q*− limit of agreement. For stereoacuity expressed in log arcseconds, the limit of agreement has units log

_{10}arcseconds. If we express stereoacuity in arcseconds, the limit of agreement becomes a factor,

^{44}as shown on the right-hand vertical axis. Limits of agreement are slightly smaller for observers with larger thresholds, reflecting the fact that ASTEROID will not present disparities larger than 1200 arcsec (see Methods, Disparity Cap). The black line shows results for an observer matching the ASTEROID assumptions; that is, their lapse rate

*λ*

_{M}= 0.03 and slope parameter

*b*

_{M}= 4.885/log

_{10}arcsec. Better reliability, that is, tighter limits, are obtained for the model observer shown in red, who has a steeper psychometric function,

*b*

_{M}= 14.654/log

_{10}arcsec. This is because a threshold is more tightly defined when the psychometric function rises more steeply from chance to optimal, as we saw in Figure 7. Conversely, reliability is worse for the model observer shown in blue, where

*b*

_{M}= 2.931/log

_{10}arcsec. The green curve shows an observer with

*b*

_{M}= 4.885/log

_{10}arcsec, but a high lapse rate of 0.1. This might describe a small child who is regularly distracted. Here, limits of agreement are unsurprisingly higher.

_{10}arcsec or a factor of 3 (∼10

^{0.5}). That is, with only 20 trials, a second threshold estimate may be a factor of 3 higher or lower than the first threshold estimate, simply due to the stochastic nature of psychophysical judgments. This places a fundamental limit on the test-retest reliability we can expect from ASTEROID.

_{10}arcsec (103 arcsec) and the SD was 0.21 log

_{10}arcsec. In each of the individual threshold measurements, the posterior distribution for the log threshold after 20 trials generally resembles a Gaussian. We therefore estimated the standard deviation of each individual threshold as half the distance between the 84% and 16% percentiles for the posterior. The mean SD of the 24 posterior distributions for ZC is 0.17 log

_{10}arcsec, with rather little variation (SD of the SD estimates = 0.02 log

_{10}arcsec), just slightly larger than the same as the overall standard deviation of the thresholds (0.21 log

_{10}arcsec). For observer HA, these numbers were

*n*= 55 thresholds; mean = 1.2 log

_{10}arcsec (16 arcsec), SD = 0.43 log

_{10}arcsec; the standard deviation estimated from staircases was 0.19 ± 0.05 log

_{10}arcsec. Thus, observer ZC was nearly as consistent as possible given the staircase; observer HA showed more variability.

*P*< 10

^{−7}) and Spearman coefficient is 0.63 (

*P*< 10

^{−4}). For comparison, Bosten et al.

^{31}report a Spearman test-retest correlation coefficient of 0.67 on a laboratory stereotest using 50 trials. Thus, ASTEROID's value of 0.63 for 20 stereo trials stands up well.

^{43}The vertical axis shows the difference between the results of two tests on the same observer. The horizontal axis shows the mean result. The horizontal dotted line shows the mean of all 40 differences. This is not significantly different from zero (paired

*t*-test on log thresholds,

*P*= 0.57), meaning that there is no evidence for systematic changes, for example, due to practice or fatigue. The horizontal dashed lines show the Bland-Altman 95% limits of agreement, equal to ±1.96 times the SD of the differences. This is equivalent to the colored lines in Figure 8B, except there we folded the plot and showed only the upper value. The 95% limit of agreement is 0.64, corresponding to a factor of 4.3 (that is, the range is ±0.64 log

_{10}arcsec, which corresponds to multiplying or dividing the threshold in arcseconds by a factor of 4.3). This is comparable to the values obtained from our simulations (Fig. 8B), confirming that with real observers the test achieves close to the maximum reliability permitted by the staircase procedure given the short number of trials.

_{10}arcsec for these 35 participants, very close to the values obtained for observers ZC and HA in the previous section. Observers ZC and HA each performed >20 threshold estimates, so we were able to compare the theoretical standard deviation estimated from the staircase with the empirical standard deviation from multiple measurements in the same observer. In this section, each of the 35 observers carried out only two measurements, which does not enable an accurate measurement of standard deviation in individuals. However, if we assume that all observers have the same standard deviation, we can compare this empirical population standard deviation with the staircase estimate. Assume that each observer has a unique threshold,

*q*, and that each time we try and measure this, we get a value drawn from a normal distribution with mean

*q*and SD σ (all in log

_{10}arcseconds). Then, the standard deviation of the difference between two measurements for the same observer is also drawn from a normal distribution, with mean 0 and SD σ√2. Our staircase estimates that σ = 0.18 log

_{10}arcsec, so σ√2 = 0.25 log

_{10}arcsec. Empirically, the standard deviation of the difference between two threshold measurements in the same observer (Fig. 10B) was 0.32 log

_{10}arcsec, slightly larger but close to the value predicted by the staircase. This provides further confirmation that the statistics observed with human participants are as expected from simulations.

*P*= 0.03, paired sample

*t*-test on log thresholds). The correlation between scores on the two tests was not significant, partly because of the small number of participants and partly because their thresholds were all very similar (all under 100 arcsec).

_{10}arcsec or a factor of 1.7. This is for a comparison between two different tests, so the test-retest agreement between ASTEROID must be at least as good. Thus, while we found previously that the limits of agreement corresponded to a factor of 4.6 for a single ASTEROID test (20 presentations), this is reduced to 1.7 if one takes the geometric mean of three ASTEROID thresholds (60 presentations).

^{9}The stereotest used in the previous study was similar to ASTEROID but had two major differences: (1) it was presented on the same stereoscopic 3D TV, viewed at 200 cm, instead of on an autostereo tablet, and (2) it used a Bayesian staircase with 35 trials instead of 20. Figure 12 compares the distribution of stereo thresholds from 91 adult participants in that study (SP2016, blue) with that from a different group of 74 adult participants tested on ASTEROID (red). Participant demographics were SP2016:

*n*= 91, age range 18 to 73 years, mean 31, SD 17; ASTEROID:

*n*= 74; 49 female, 25 male; age range 18 to 79 years, mean 26, SD 12.

^{9}used a two-alternative version of the task but still defined threshold as a performance of 75% correct. On a two-alternative task, chance is 50% correct, so less signal is required to reach 75% correct. We showed in the Methods (Correcting Threshold for Tasks with Different Numbers of Alternatives) that this means 2AFC thresholds should be lower by a factor of around 1.4: an observer who scored 100 arcsec in SP2016 should score 140 arcsec on ASTEROID. We therefore multiplied all the SP2016 thresholds by 1.4 before plotting them in Figure 12.

_{10}arcsec for SP2016 and 0.34 log

_{10}arcsec for ASTEROID, corresponding to a factor of 2.2 (that is, a range of ±1 SD corresponds to multiplying or dividing the threshold in arcseconds by a factor of 2.2). The fact that the between-subjects standard deviations of both samples is so similar, and in fact slightly smaller for ASTEROID, also implies that the within-subject “noise” on the measurement must be similar between the two tests, that is, the decrease from 35 to 20 presentations has not impacted reliability substantially.

*p*< 10

^{−6},

*t*= −5.4, Welch two-sample

*t*-test on log thresholds below 1000 arcsec), even after the correction for 2AFC versus 4AFC. In SP2016, the mean was 1.44 log

_{10}arcsec, corresponding to 27 arcsec (median 26 arcsec), whereas on ASTEROID it was 1.75 log

_{10}arcsec, corresponding to 57 arcsec (median 47 arcsec). That is, stereo thresholds were around twice as high for the sample tested on the 3D tablet as compared to a different sample tested on the stereoscopic 3D TV.

*r*= 0.37,

*P*= 0.01, Pearson product-moment correlation on log thresholds). The mean score for non-stereoblind participants on ASTEROID is nearly twice as high, 70 arcsec on ASTEROID compared to 51 for Randot Preschool stereotest (mean of log thresholds), a ratio of 1.38. This is similar to the ratio estimated in experiment 4 comparing ASTEROID to the 3D TV test.

^{9,16,31,46}with standard deviation variously reported as 0.23 log

_{10}arcsec

^{46}or 0.37 log

_{10}arcsec

^{9}for non-stereoblind observers. The standard deviation for non-stereoblind observers with ASTEROID in Figure 13 is 0.35 log

_{10}arcsec, in line with these estimates. Thus, it is likely ASTEROID has correctly captured the variation within our sample of observers. In the Randot Preschool stereotest, most of this variation is collapsed to a single score of 40 arcsec, obscuring genuine differences in stereoacuity between observers.

_{10}arcsec, corresponding to a factor of 4.3. This includes other sources of error, plus variations in the observer's state (concentration, etc.) as well as stochastics. However, the fact that the number agrees so well with the simulations indicates that stochastics are the major contributor.

*b*of 4.885/log

_{10}arcsec (see Equation 9). This means that a 4.3-fold change in disparity around threshold changes performance from 37% to 85% correct (Fig. 14). Thus, in 20 presentations ASTEROID is able to find the disparity where performance is in this range, but more presentations would be required to narrow this down further.

^{44}report that the 95% limits of agreement with adults are 0.57 log arcsec for the Randot Preschool stereotest, corresponding to a factor of 3.7. However, these figures cannot be directly compared, since current tests return only discrete scores. Successive scores of 80 and 140 arcsec on ASTEROID would be classed as different, whereas on the Randot Preschool stereotest these would both be 100 arcsec and thus agree perfectly. If we quantize ASTEROID threshold estimates to the closest (in log space) available score on the Randot Preschool stereotest and then repeat the Bland-Altman analysis of Figure 10, we find that the 95% limits of agreement for ASTEROID are now 0.56 log

_{10}arcsec or a factor of 3.6, almost exactly the same as Adams et al.

^{44}found for the Randot Preschool stereotest. Ma et al.

^{12}reported high test-retest reliability for their computerized distance stereoacuity test: 95% limits of agreement 0.29 log

_{10}arcsec, or a factor of 1.9. However, this is because their participants were drawn from an eye clinic and many were strabismic, so very few could perform the test at all: of the 81 participants, 79 scored “nil” (stereoblind) on the first test and 69 on the retest. In adults with normal binocular vision, this group reports the 95% limits of agreement as 0.47 log arcsec or a factor of 3.0.

^{13}From their staircase statistics, Hess et al.

^{16}estimate 90% limits of agreement as a factor of 1.9 (they use

*Z*-score boundaries of 1.65, not 1.96 as for the 95% limit). However, this tight predicted agreement is not borne out empirically. In their Bland-Altman plot, the 95% limits of agreement are 0.75 log

_{10}arcsec or a factor of 5.6. This is a little higher than ASTEROID, even though each of their measurements represents twice as many trials (40–60 trials, compared to 20 for ASTEROID). The greater number of trials required to achieve similar reliability probably reflects their two-alternative task, since each presentation of a two-alternative task conveys half as much information as for a four-alternative task. Like us, Hess et al.

^{12}find a 0.79 correlation between two tests on the same observer (Fig. 9A).

^{9}(Fig. 11) or obtained with the Randot Preschool stereotest (Fig. 12). Hess et al.

^{12}report a modal non-stereoblind threshold of around 1.4 log

_{10}arcsec or 25 arcsec, again about a factor of 2 lower than ASTEROID. The difference with Randot Preschool stereotest is not surprising given the very different stimuli. The Randot Preschool stereotest is static and potentially contains monocular cues if participants tilt their head.

^{4}

^{9}Although the large dots do not impose a Nyquist limit on spatial stereoresolution,

^{47}since the stimulus is dynamic, we consider it plausible that they could contribute to higher thresholds. These reasons for differences compared with other tests are not concerning. Clinically, stereo thresholds need to be compared to population norms or to thresholds obtained from the same patient with the same test at a different time, for example before and after amblyopia treatment. Thus, differences due to stimulus properties are not important provided that the test norms are known.

^{48,49}As discussed in experiment 1, cross talk is severe when the device is held too close. Even at appropriate viewing distances, cross talk can still occur if the observer tilts the device to left or right. Thus, with the tablet, we depended on participants to hold the tablet correctly so as to eliminate cross talk, whereas in Serrano-Pedraza et al.,

^{9}adult participants used a headrest to ensure that their viewing position was correct. Other differences include the pixel resolution of the Commander 3D tablet used for ASTEROID. The minimum relative disparity depictable in whole pixels on the tablet is 118 arcsec at a viewing distance of 40 cm compared to 54 arcsec in Serrano-Pedraza et al.,

^{9}yet the mean threshold of adult participants on ASTEROID was 57 arcsec (Fig. 12). Thus, threshold estimates are dependent upon our techniques for obtaining subpixel disparities. Inaccuracies in our measurement of viewing distance could also contribute. If the actual viewing distance was larger than that recorded by the device, then a given screen parallax would be recorded as an erroneously high retinal disparity, leading to an erroneously high estimate of the participant's stereo threshold. These reasons would be unsatisfactory because they could introduce sources of variability. However, some reassurance is provided by the fact that the error estimate from simulations generally aligns well with the variability observed over repeated measurements in the same observer. The main source of variability in ASTEROID thresholds is simply that imposed by the statistics of 20 trials.

**K. Vancleef**, None;

**I. Serrano-Pedraza**, None;

**C. Sharp**, None;

**G. Slack**, None;

**C. Black**, None;

**T. Casanova**, None;

**J. Hugill**, None;

**S. Rafiq**, None;

**J. Burridge**, None;

**V. Puyat**, None;

**J.E. Enongue**, None;

**H. Gale**, None;

**H. Akotei**, None;

**Z. Collier**, None;

**H. Haggerty**, None;

**K. Smart**, None;

**C. Powell**, None;

**K. Taylor**, None;

**M.P. Clarke**, None;

**G. Morgan**, None;

**J.C.A. Read**, Magic Leap (C), Huawei (F), Philosophical Transactions of Royal Society B (S)

*Cochrane Database Syst Rev*. 2013; 7: CD004917.

*Eye*. 1996; 10: 282–285.

*Eye (Lond)*. 2015; 29: 214–224.

*J Vis*. 2016; 16: 13.

*Can J Ophthalmol*. 1991; 26: 12–17.

*Perception*. 1997; 26: 977–994.

*J Physiol*. 1970; 211: 599–622.

*Invest Ophthalmol Vis Sci*. 1979; 18: 614–621.

*Investig Ophthalmol Vis Sci*. 2016; 57: 960–970.

*Graefes Arch Clin Exp Ophthalmol*. 2001; 239: 562–566.

*Behav Res Methods Instrum*. 1982; 14: 21–25.

*Am J Ophthalmol*. 2013; 156: 195–201.

*Am J Ophthalmol*. 2011; 151: 1081–1086.

*PLoS One*. 2015; 10: e0116626.

*Invest Ophthalmol Vis Sci*. 2006; 47: 4842–4846.

*Investig Opthalmology Vis Sci*. 2016; 57: 798–804.

*PLoS One*. 2018; 13: e0201366.

*ACM International Conference Proceeding Series*. Part F1286. New York, NY: AMC. 2017: 216–220.

*PLoS One*. 2018; 13: e0201366.

*Vision Res*. 1990; 30: 1955–1970.

*J Optom*. 2009; 2: 3–18.

*J Opt Soc Am*. 1963; 53: 129–160.

*Science*. 1961; 134: 168–177.

*Percept Psychophys*. 2001; 63: 1293–1313.

*J Vis*. 2016; 16: 838.

*Proceedings of the 11th annual conference on Computer graphics and interactive techniques - SIGGRAPH '84*. New York, NY: AMC Press; 1984; 18: 253–259.

*Vis Res*. 1994; 34: 885–912.

*Exp Brain Res*. 1970; 10: 380–388.

*PLoS One*. 2013; 8: e82999.

*Br J Ophthalmol*. 2006; 90: 91–95.

*Vision Res*. 2015; 110: 34–50.

*Invest Ophthalmol Vis Sci*. 1997; 38: 557–568.

*Perception*(ECVP Abstract Supplement). 2007; 36: 1–16.

*J Vis*. 2015; 15: 2.

*Percept Psychophys*. 1979; 26: 168–170.

*J Acoust Soc Am*. 1971; 49: 505–508.

*J Acoust Soc Am*. 1967; 41: 782–787.

*J Acoust Soc Am*. 1993; 93: 2096–2105.

*J Acoust Soc Am*. 2012; 132: 957–967.

*Ophthalmic Physiol Opt*. 2017; 37: 507–520.

*J Acoust Soc Am*. 1990; 87: 2662–2674.

*XVI Meeting of the Child Vision Research Society*. 2017: 8.

*Lancet*. 1986; 1: 307–310.

*Ophthalmology*. 2009; 116: 281–285.

*Iperception*. 2015; 6: 204166951559302.

*Invest Ophthalmol Vis Sci*. 2003; 44: 891–900.

*J Neurosci*. 2014; 34: 1397–1408.

*International Society for Optics and Photonics*. 2011; 7863: 786313.

*IEEE Trans Broadcast*. 2011; 57: 445–453.

*Int J Ophthalmol*. 2015; 8: 374–381.