**Purpose**:
To develop a class of new metrics for evaluating the performance of intraocular lens power calculation formulas robust to issues that can arise with AI-based methods.

**Methods**:
The dataset consists of surgical information and biometry measurements of 6893 eyes of 5016 cataract patients who received Alcon SN60WF lenses at University of Michigan's Kellogg Eye Center. We designed two types of new metrics: the MAEPI (Mean Absolute Error in Prediction of Intraocular Lens [IOL]) and the CIR (Correct IOL Rate) and compared the new metrics with traditional metrics including the mean absolute error (MAE), median absolute error, and standard deviation. We evaluated the new metrics with simulation analysis, machine learning (ML) methods, as well as existing IOL formulas (Barrett Universal II, Haigis, Hoffer Q, Holladay 1, PearlDGS, and SRK/T).

**Results**:
Results of traditional metrics did not accurately reflect the performance of overfitted ML formulas. By contrast, MAEPI and CIR discriminated between accurate and inaccurate formulas. The standard IOL formulas received low MAEPI and high CIR, which were consistent with the results of the traditional metrics.

**Conclusions**:
MAEPI and CIR provide a more accurate reflection of the real-life performance of AI-based IOL formula than traditional metrics. They should be computed in conjunction with conventional metrics when evaluating the performance of new and existing IOL formulas.

**Translational Relevance**:
The proposed new metrics would help cataract patients avoid the risks caused by inaccurate AI-based formulas, whose true performance cannot be determined by traditional metrics.

^{1}

^{–}

^{3}These are standard evaluation metrics commonly used for regression problems in which the target value is a scalar. The MAE summarizes the average distance between the prediction and the true value. The MedAE evaluates the median deviation and is less sensitive to outliers and extreme values. The standard deviation (SD) measures the extent of scattering of the PE. Aside from these standard metrics, ophthalmologists also calculate the percentage of PEs within a certain range (e.g., ± 0.25

*D*, ± 0.5

*D*), and the performance in different axial length (AL) groups (short, medium, and long). The former is a convenient way of investigating the distribution of PEs. The latter aids in determining whether a formula has consistent performance among myopic, hyperopic, and regular eyes. Recently, Hoffer and Savini

^{2}demonstrated a new evaluation metric, the IOL Formula Performance Index, which combines multiple metrics into one: (1) the SD, (2) the MedAE, (3) the AL bias, and (4) the percentage of eyes with PE within ± 0.5

*D*. Holladay et al.

^{4}reviewed IOL calculation evaluation metrics and recommended the SD as the single best measurement because SD allows the use of heteroscedastic statistical methods and SD predicts the percentage of cases within a given interval, the mean absolute error, and the median of the absolute errors. However, this conclusion was drawn based on the results of 11 optics-based IOL formulas (Barrett, Olsen, Haigis, Haigis WK, Holladay 1, Holladay 1 WK, Holladay 2, SRK/T, SRK/T WK, Hoffer Q, and Hoffer Q WK), which have been validated extensively with real-world datasets. For ML-based formulas, the algorithm is oftentimes a black box, of which the exact behavior is not known a priori. When evaluating or developing novel ML-based IOL formulas, it is important that the evaluation metric is appropriately selected and robust enough so that the trained model can be generalized to unseen data. In addition, there is evidence from the study of Gatinel et al.

^{5}indicating that the lens constant value that serves to cancel the systematic bias is likely to unpredictably vary the SD.

^{6}

^{,}

^{7}because they measure the average behavior of the most frequent cases, although rare cases may be of greater interest. In contrast to numerous publications focused on imbalanced classifications, little research has been conducted on the metrics for imbalanced regressions. As summarized in Table 1, previously proposed metrics for imbalanced regression problems include weighted errors, asymmetric loss functions,

^{8}precision-recall evaluation framework-based metrics,

^{9}

^{,}

^{10}receiver operating characteristic curves for regression,

^{11}

^{–}

^{13}and ranking-based evaluation.

^{14}Unfortunately, none of these metrics are targeting the underlying optimization goal within the context of IOL power prediction.

^{15}we identified multiple weaknesses of conventional metrics that, to our knowledge, have not been discussed in the literature. Traditional metrics can generate misleading information for formulas developed solely based on historical data, as is commonly the case with machine learning-based methods. In this work, we demonstrate a series of new IOL formula accuracy evaluation metrics, which should be used alongside traditional metrics when evaluating the performance of IOL formulas.

^{16}

^{–}

^{20}We included the manifest refractions measured by trained technicians employed by University of Michigan's Kellogg Eye Center at or closest to one month after the surgery. The postoperative refraction was computed with the following equation using an adjustment with regard to the lane length at Kellogg Eye Center (10 feet, 3.048 meters): spherical equivalent (SE) refraction = (spherical component − 0.1614) + 0.5 × cylindrical component. The adjustment factor was determined according to Simpson and Charman's recommendation.

^{21}

*n*eyes, for the

*i*

^{th}eye, we shall denote the actual postoperative refraction as

*y*, and the predicted postoperative refraction with a given prediction method for the implanted IOL as \({\hat{y}_i}\). Further we define

_{i}*e*as the refraction prediction error (PE) of the

_{i}*i*

^{th}eye, and the PE equals the actual refraction minus the predicted refraction: \({e_i} = {y_i} - {\hat{y}_i}\). The mean absolute error can then be calculated as follows:

^{2}

*m*, calculated as the slope of the correlation between the prediction error and the AL. (4) the percentage of patients with predictions errors within ± 0.5 D, represented as

*n*.

*i*

^{th}eye, we define its implanted IOL power as

*p*and the predicted IOL power as \({\hat{p}_i}\). For IOL power

_{i}*p*(for example, 6

*D*≤

*p*≤ 30

*D*,

*step*= 0.5

*D*), we defined the corresponding predicted postoperative refraction as \(\hat{y}_i^p\). The predicted IOL power \({\hat{p}_i}\) is found by minimizing the absolute difference between the actual postoperative refraction

*y*and the predicted postoperative refraction \(\hat{y}_i^p\) while altering the value of

_{i}*p*. Therefore the relationship between \({\hat{p}_i}\), and \(\hat{y}_i^p\) can be defined with the following equation (Table 2):

*p*. Specific examples for calculating the MAEPI and CIR are shown in Supplementary Materials Appendix A. Lower MAEPI and higher CIR mean better prediction performance.

_{i}*e*(\(e \in \mathbb{R}\)) to represent the refraction PE of the implanted IOL. We denote by

*d*the increment of the prediction value when the IOL power is increased by 0.5 D. Because the spherical equivalent refraction of the eye should decrease with increasing power of the IOL, we assume here that

*d*is always negative.

*d*< 0 is a constant for all patients. Scenario (2): The IOL power and predicted refraction have a nonlinear relationship, meaning that

*d*< 0 is not a constant. Scenario (3): Predictions are random.

*d*is a negative constant, it can be proven that the MAEPI and refraction MAE are always consistent, meaning that they always have a non-negative correlation. Consider a fictitious case

*A*, wherein the PE of the implanted IOL is

*e*(

_{a}*e*∈

_{a}*R*). We can represent the predicted IOL power as (

*implanted*

*IOL*

*power*+

*step*

*of*

*the*

*IOL*

*power**

*n*), where \({n_a} \in \mathbb{Z}\). The difference between the actual postoperative refraction and the predicted refraction for the predicted IOL power and is then

_{a}*e*=

_{pa}*e*−

_{a}*n*. Similarly, for case B, the PE of the implanted IOL is

_{a}d*e*(

_{b}*e*∈

_{b}*R*), and the difference between the actual refraction and the predicted refraction for the predicted IOL power is

*e*=

_{pb}*e*−

_{b}*n*, where \({n_b} \in \mathbb{Z}\). Based on the mathematical derivation in Supplementary Materials Appendix B, we see that |

_{b}d*e*| > |

_{a}*e*| implies |

_{b}*n*| ≥ |

_{a}*n*| for arbitrary eyes A and B and thus MAEPI and refraction MAE have a non-negative correlation under the aforementioned conditions.

_{b}*A*,

*B*,

*C*, and

*D*are functions of preoperative biometry measurements. The first derivative of this function is

*d*< 0 and

*d*is not a constant.

*e*and

*d*. This scenario helps to demonstrate the general behavior of the MAEPI and refraction prediction MAE, with no assumptions on the characteristics of the IOL formula.

*e*was randomly generated for each simulated patient. A fixed value of

*d*was used across all cases and all IOL powers for scenario (1). A random value of

*d*was selected for each simulated patient and each IOL power for scenario (2). Schematics of each scenario are illustrated in Figure 1A, and specific examples of simulated patients are shown in the Supplementary Materials Appendix E.

^{22}) and XGBoost

^{23}(an ML framework for gradient boosting tree-based algorithms) with the training set. SVM and XGBoost are both state-of-the-art ML frameworks, commonly used for a wide variety of applications, including ophthalmology. The preoperative biometry measurements and patient information used as features are listed as follows: the SN60WF IOL power (D), patient gender, patient age at surgery (years), eye laterality, axial length (mm), central corneal thickness (µm), aqueous depth (mm), anterior chamber depth (mm), lens thickness (mm), flat and steep keratometry (D), astigmatism (D), and white-to-white (mm). The hyperparameters for the machine learning models were optimized based on fivefold cross-validation. The models were optimized by minimizing the cross-validation MAE. The values of the hyperparameters are shown in Supplementary Materials Appendix G.

^{24}

^{,}

^{25}The Haigis, Hoffer Q, Holladay 1, and SRK/T formulas were implemented in Python based on their publications.

^{26}

^{–}

^{33}The formula constants were optimized by zeroing out the mean prediction error in the training data (Table 4). We plotted and calculated the correlation between the IOL power prediction errors and the refraction prediction error for the above-mentioned formulas.

*P*value < 0.05.

*e*,

*d*, or the refraction prediction in Table 3 did not change the general characteristics of the simulation results (results not shown).

*P*value for the refraction MAEs of Holladay and XGBoost was not significant (

*P*> 0.05) (complete results shown in Supplementary Table S2). The Wilcoxon test

*P*value for the MAEPIs was significant (

*P*< 0.05) (complete results shown in Supplementary Table S3).

*D*, and for Holladay 1 it was (−0.0346) − (−0.347) = 0.3124

*D*. On the other hand, the IOL power PE for XGBoost can be calculated as

*actual*

*IOL*

*power*−

*predicted*

*IOL*

*power*= 20.5 − 19 = 1.5

*D*, and for Holladay 1 it was 20.5 − 20 = 0.5

*D*.

- (1) The basic mechanism of ML models is to learn patterns from historical data, which sometimes makes them vulnerable to noise and biases in the training data, thus overfitting to the historical data if not properly trained. We trained machine learning models with the SVM algorithm and the XGBoost framework, and showed that both can be overfitted. It is possible to overcome overfitting through machine learning techniques such as data augmentation, resampling or by integrating theoretical components to the ML model. Although of potential interest, a review of these methods is beyond the scope of this work.
- (2) The IOL powers in the dataset were chosen through clinical decision making, and the postoperative refractions were influenced by the IOL powers. Both quantities have a unimodal distribution (Fig. 3), meaning that the distribution has one clear peak. Because of this reason, a calculator that simply predicts random values around the mean of historical data resulted in a reasonable MAE and SD of refraction prediction errors (Table 6, “Random” method). This method of fooling the refraction metrics can be easily “learned” by machine learning algorithms. As mentioned before, the dataset can overfit the model with meaningless information or information that is specific to the dataset. In the case of IOL prediction, the postoperative refractions are not randomly drawn from a distribution but are influenced by the IOL power that is manually selected by a surgeon based on discussions of refractive target with a patient (typically between −3.0 and 0.0 D). The fact that the postoperative refraction in historical datasets follows a certain distribution can be misleading for ML models. Practically, there is a tendency for ML algorithms to take advantage of such properties during training and inadvertently develop models that are not representative of the true system underlying the observations presented. This is a generalization issue caused by the mismatch between historical datasets and unseen patients.
- (3) When evaluating formulas using a historical dataset, the conventional metrics assume known implanted IOL powers correspond to the postoperative refraction, which is not practical in real clinical settings. In contrast to that, the IOL metrics make no assumption about which IOL power was implanted. Suppose that the patient in Table 8 were an incoming patient participating in a clinical trial testing the performance of XGBoost and Holladay 1, with the target refraction = −0.0346 D, the surgeon using XGBoost would end up picking a 19.0 D lens, and the surgeon using Holladay 1 would pick a 20.0 D lens. It is likely that clinical trial results would agree with the MAEPI rather than the refraction MAE calculated from Table 8. Therefore the IOL metrics are more intuitive and are better representations of real-life performance of IOL formulas using modern empirical methods such as ML.

^{6}

^{,}

^{7}

^{,}

^{34}However, the existing evaluation frameworks for imbalanced regression does not provide an immediate solution for evaluating IOL formulas’ performance. The regression error characteristics curve, which plots the absolute deviation tolerance versus accuracy, can be viewed as an expanded version of proportions of patients with errors within different intervals.

^{11}The regression error characteristics curve provides a description of the cumulative distribution function of the errors; however, its limitation is also recognized in previous research.

^{12}The standard deviation of the prediction error was recently put forth as the “single best parameter” to characterize the performance of an IOL formula.

^{4}However, the article investigated only geometric optics-based formulas that have been tested extensively over the years. From the point of view of a surgeon, a critical question is whether reported evaluation metrics are able to predict the real-life performance of a new formula. In this study, we have demonstrated that the SD, as well as the other traditional metrics such as MAE, MedAE, and the percentage of patients in different error intervals, can be easily fooled by ML methods (Table 6). The Random, SVM and XGBoost model achieved abnormal high FPI (Table 6), which implies that the FPI is not a reliable evaluation metric.

**T. Li**, None;

**J.D. Stein**, None;

**N. Nallasamy**, None

*J Cataract Refract Surg*. 2017; 43: 999–1002. [CrossRef] [PubMed]

*Ophthalmology*. 2021; 128(11): e115–e120. [CrossRef] [PubMed]

*Am J Ophthalmol*. 2015; 160: 403–405.e1. [CrossRef] [PubMed]

*J Cataract Refract Surg*. 2021; 47: 65–77. [CrossRef] [PubMed]

*Transl Vis Sci Technol*. 2022; 11: 5. [CrossRef] [PubMed]

*Imbalanced Learn Found Algorithms*. Appl 2013: 187–206.

*First International Workshop on Learning with Imbalanced Domains: Theory and Applications*. PMLR; 2017: 129–140.

*Econ Theory*. 1997; 13: 808–817. [CrossRef]

*European conference on principles of data mining and knowledge discovery*. Springer; 2007: 597–604.

*Utility-based regression*. Porto, Portugal: University of Porto. 2011. Dissertation.

*Proceedings of the 20th international conference on machine learning (ICML-03)*. 2003: 43–50.

*Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining*. 2005: 697–702.

*Knowl Inf Syst*. 2007; 12: 331–353. [CrossRef]

*Br J Ophthalmol*. https://doi.org/10.1136/bjophthalmol-2021-320599.

*JAMA Ophthalmol*. 2019; 137: 491–497. [CrossRef] [PubMed]

*JAMA Ophthalmol*. 2020; 138: 974–980. [CrossRef] [PubMed]

*Br J Ophthalmol*. 2022; 106: 1222–1226. [CrossRef] [PubMed]

*Transl Vis Sci Technol*. 2020; 9: 38. [CrossRef] [PubMed]

*BMC Ophthalmol*. 2021; 21: 183. [CrossRef] [PubMed]

*J Refract Surg*. 2014; 30: 726. [CrossRef] [PubMed]

*J Mach Learn Res*. 2011; 12: 2825–2830.

*Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*. New York: Association for Computing Machinery; 2016: 785–794.

*J Cataract Refract Surg*. 1990; 16: 333–340. [CrossRef] [PubMed]

*J Cataract Refract Surg*. 1993; 19: 700–712. [CrossRef] [PubMed]

*Graefe's Arch Clin Exp Ophthalmol*. 2000; 238: 765–773. [CrossRef]

*J Cataract Refract Surg*. 1988; 14: 17–24. [CrossRef] [PubMed]

*J Cataract Refract Surg*. 1990; 16: 528. [PubMed]

*J Cataract Refract Surg*. 2007; 33: 2. [CrossRef] [PubMed]

*J Cataract Refract Surg*. 2007; 33: 2–3. [CrossRef] [PubMed]

*International Conference on Enterprise Information Systems*. Berlin: Springer; 2011: 35–50.