March 2023
Volume 12, Issue 3
Open Access
Artificial Intelligence  |   March 2023
MAEPI and CIR: New Metrics for Robust Evaluation of the Prediction Performance of AI-Based IOL Formulas
Author Affiliations & Notes
  • Tingyang Li
    Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
  • Joshua D. Stein
    Kellogg Eye Center, Department of Ophthalmology and Visual Sciences, University of Michigan, Ann Arbor, MI, USA
    Center for Eye Policy and Innovation, University of Michigan, Ann Arbor, MI, USA
    Department of Health Management and Policy, University of Michigan School of Public Health, Ann Arbor, MI, USA
  • Nambi Nallasamy
    Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
    Kellogg Eye Center, Department of Ophthalmology and Visual Sciences, University of Michigan, Ann Arbor, MI, USA
  • Correspondence: Nambi Nallasamy, Kellogg Eye Center, University of Michigan, 1000 Wall St., Ann Arbor, MI 48105, USA. e-mail: nnallasa@med.umich.edu 
Translational Vision Science & Technology March 2023, Vol.12, 29. doi:https://doi.org/10.1167/tvst.12.3.29
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Tingyang Li, Joshua D. Stein, Nambi Nallasamy; MAEPI and CIR: New Metrics for Robust Evaluation of the Prediction Performance of AI-Based IOL Formulas. Trans. Vis. Sci. Tech. 2023;12(3):29. https://doi.org/10.1167/tvst.12.3.29.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose: To develop a class of new metrics for evaluating the performance of intraocular lens power calculation formulas robust to issues that can arise with AI-based methods.

Methods: The dataset consists of surgical information and biometry measurements of 6893 eyes of 5016 cataract patients who received Alcon SN60WF lenses at University of Michigan's Kellogg Eye Center. We designed two types of new metrics: the MAEPI (Mean Absolute Error in Prediction of Intraocular Lens [IOL]) and the CIR (Correct IOL Rate) and compared the new metrics with traditional metrics including the mean absolute error (MAE), median absolute error, and standard deviation. We evaluated the new metrics with simulation analysis, machine learning (ML) methods, as well as existing IOL formulas (Barrett Universal II, Haigis, Hoffer Q, Holladay 1, PearlDGS, and SRK/T).

Results: Results of traditional metrics did not accurately reflect the performance of overfitted ML formulas. By contrast, MAEPI and CIR discriminated between accurate and inaccurate formulas. The standard IOL formulas received low MAEPI and high CIR, which were consistent with the results of the traditional metrics.

Conclusions: MAEPI and CIR provide a more accurate reflection of the real-life performance of AI-based IOL formula than traditional metrics. They should be computed in conjunction with conventional metrics when evaluating the performance of new and existing IOL formulas.

Translational Relevance: The proposed new metrics would help cataract patients avoid the risks caused by inaccurate AI-based formulas, whose true performance cannot be determined by traditional metrics.

Introduction
The prediction performance of intraocular lens (IOL) formulas for cataract patients is usually evaluated with the following metrics: the mean prediction error (ME), the mean absolute error (MAE), median absolute error (MedAE), and standard deviation (SD) of the prediction error (PE), as recommended in multiple publications.13 These are standard evaluation metrics commonly used for regression problems in which the target value is a scalar. The MAE summarizes the average distance between the prediction and the true value. The MedAE evaluates the median deviation and is less sensitive to outliers and extreme values. The standard deviation (SD) measures the extent of scattering of the PE. Aside from these standard metrics, ophthalmologists also calculate the percentage of PEs within a certain range (e.g., ±  0.25 D, ±  0.5 D), and the performance in different axial length (AL) groups (short, medium, and long). The former is a convenient way of investigating the distribution of PEs. The latter aids in determining whether a formula has consistent performance among myopic, hyperopic, and regular eyes. Recently, Hoffer and Savini2 demonstrated a new evaluation metric, the IOL Formula Performance Index, which combines multiple metrics into one: (1) the SD, (2) the MedAE, (3) the AL bias, and (4) the percentage of eyes with PE within ±  0.5 D. Holladay et al.4 reviewed IOL calculation evaluation metrics and recommended the SD as the single best measurement because SD allows the use of heteroscedastic statistical methods and SD predicts the percentage of cases within a given interval, the mean absolute error, and the median of the absolute errors. However, this conclusion was drawn based on the results of 11 optics-based IOL formulas (Barrett, Olsen, Haigis, Haigis WK, Holladay 1, Holladay 1 WK, Holladay 2, SRK/T, SRK/T WK, Hoffer Q, and Hoffer Q WK), which have been validated extensively with real-world datasets. For ML-based formulas, the algorithm is oftentimes a black box, of which the exact behavior is not known a priori. When evaluating or developing novel ML-based IOL formulas, it is important that the evaluation metric is appropriately selected and robust enough so that the trained model can be generalized to unseen data. In addition, there is evidence from the study of Gatinel et al.5 indicating that the lens constant value that serves to cancel the systematic bias is likely to unpredictably vary the SD. 
A special characteristic of cataract patient datasets is that the data is highly imbalanced because the IOL powers were manually selected rather than randomly drawn and some powers were selected more often than the others. The postoperative refractions therefore represent a highly biased view of the expected outcome, assuming that the IOL power is not specified. When trained on such imbalanced datasets without additional regularization and thoughtful algorithm design, ML predictions are likely to be dominated by the over-represented domains, or, in other words, the algorithm will tend to always predict the most common numbers (for regression) and labels (for classification). 
For imbalanced regression problems, standard evaluation metrics (such as the MAE) are known to provide frequently misleading conclusions6,7 because they measure the average behavior of the most frequent cases, although rare cases may be of greater interest. In contrast to numerous publications focused on imbalanced classifications, little research has been conducted on the metrics for imbalanced regressions. As summarized in Table 1, previously proposed metrics for imbalanced regression problems include weighted errors, asymmetric loss functions,8 precision-recall evaluation framework-based metrics,9,10 receiver operating characteristic curves for regression,1113 and ranking-based evaluation.14 Unfortunately, none of these metrics are targeting the underlying optimization goal within the context of IOL power prediction. 
Table 1.
 
Metrics for Imbalanced Regression
Table 1.
 
Metrics for Imbalanced Regression
Choosing the correct scoring metric is a critical first step for developing or evaluating IOL formulas. Through our previous work developing machine learning-based IOL formulas,15 we identified multiple weaknesses of conventional metrics that, to our knowledge, have not been discussed in the literature. Traditional metrics can generate misleading information for formulas developed solely based on historical data, as is commonly the case with machine learning-based methods. In this work, we demonstrate a series of new IOL formula accuracy evaluation metrics, which should be used alongside traditional metrics when evaluating the performance of IOL formulas. 
Methods
Data Collection
We collected medical records of patients receiving care at the University of Michigan between August 25, 2015, and June 27, 2019. A total of 49 surgeons performed the surgeries included in the dataset. The patients were all measured before surgery with Lenstar LS 900 optical biometers (Haag-Streit USA Inc, EyeSuite software version i9.1.0.0) at University of Michigan's Kellogg Eye Center. Patient demographics (including patient age, gender, and ethnicity), the implanted IOL powers and the postoperative refraction were obtained from the Sight Outcomes Research Collaborative (SOURCE) Ophthalmology Data Repository. The data in SOURCE were described and used in various studies.1620 We included the manifest refractions measured by trained technicians employed by University of Michigan's Kellogg Eye Center at or closest to one month after the surgery. The postoperative refraction was computed with the following equation using an adjustment with regard to the lane length at Kellogg Eye Center (10 feet, 3.048 meters): spherical equivalent (SE) refraction = (spherical component − 0.1614) + 0.5 × cylindrical component. The adjustment factor was determined according to Simpson and Charman's recommendation.21 
Patients who received uneventful phacoemulsification cataract surgery (Current Procedural Terminology code = 66982 or 66984) and implantation of Alcon SN60WF one-piece acrylic monofocal lenses were included in this study. We excluded (1) patients who received previous refractive surgery or additional procedures during cataract surgery; (2) patients with postoperative visual acuity worse than 20/40; (3) records that were incomplete or out of bounds for any of the six IOL formulas analyzed in this study (Barrett Universal II, Haigis, Hoffer Q, Holladay 1, PearlDGS, and SRK/T). 
This research was conducted in compliance with the Institutional Review Board (IRB) at the University of Michigan. Informed consent was not applicable because this is a retrospective study, and all cases were fully anonymized. The study was carried out in accordance with the tenets of the Declaration of Helsinki. 
Conventional Metrics
Among a total of n eyes, for the ith eye, we shall denote the actual postoperative refraction as yi, and the predicted postoperative refraction with a given prediction method for the implanted IOL as \({\hat{y}_i}\). Further we define ei as the refraction prediction error (PE) of the ith eye, and the PE equals the actual refraction minus the predicted refraction: \({e_i} = {y_i} - {\hat{y}_i}\). The mean absolute error can then be calculated as follows:  
\begin{eqnarray*}refraction\ MAE = \frac{{\mathop \sum \nolimits_{i = 1}^n \left| {{y_i} - {{\hat{y}}_i}} \right|}}{n}\end{eqnarray*}
 
The standard deviation of the PE can be calculated as:  
\begin{eqnarray*}SD = \sqrt {\frac{{\mathop \sum \nolimits_{i = 1}^n {{\left| {{e_i} - \bar{e}} \right|}^2}}}{{n - 1}}}, \bar{e} = \frac{{\mathop \sum \nolimits_{i = 1}^n {e_i}}}{n} \end{eqnarray*}
 
The formula performance index (FPI) is calculated as demonstrated by Hoffer et al.2 
\begin{eqnarray*}FPI = {\rm{\ }}\frac{1}{{SD + MedAE + 10 * abs\left( m \right) + 10 * {{\left( {n10} \right)}^{ - 1}}}}\end{eqnarray*}
 
Four key elements are involved in FPI: (1) the SD (2) the MedAE (3) the axial length (AL) bias m, calculated as the slope of the correlation between the prediction error and the AL. (4) the percentage of patients with predictions errors within ± 0.5 D, represented as n
The MAEPI and CIR
In this section, we demonstrate the specific definitions of the MAEPI (Mean Absolute Error of the Prediction of the IOL) and Correct IOL Rate (CIR). In addition to the previously defined notations, for the ith eye, we define its implanted IOL power as pi and the predicted IOL power as \({\hat{p}_i}\). For IOL power p (for example, 6 Dp ≤ 30 D,  step = 0.5 D), we defined the corresponding predicted postoperative refraction as \(\hat{y}_i^p\). The predicted IOL power \({\hat{p}_i}\) is found by minimizing the absolute difference between the actual postoperative refraction yi and the predicted postoperative refraction \(\hat{y}_i^p\) while altering the value of p. Therefore the relationship between \({\hat{p}_i}\), and \(\hat{y}_i^p\) can be defined with the following equation (Table 2):  
\begin{eqnarray*}{\hat{p}_i} = \mathop {{\rm{argmin}}}\limits_{6 \le p \le 30,step = 0.5} \left| {{y_i} - \hat{y}_i^p} \right|\end{eqnarray*}
 
Table 2.
 
A Summary of the Variables Used in This Study
Table 2.
 
A Summary of the Variables Used in This Study
As a comparison, the predicted postoperative refraction \({\hat{y}_i}\) is defined as: 
\begin{eqnarray*}{\hat{y}_i} = \hat{y}_i^{p = {p_i}}\end{eqnarray*}
 
Similar to the definition of refraction PE, we define the IOL power prediction error as follows:  
\begin{eqnarray*} && IOL\ power\ PE = \nonumber \\ && actual\ IOL\ power - predicted\ IOL\ power\end{eqnarray*}
 
The MAEPI is therefore defined as:  
\begin{eqnarray*}MAEPI = \frac{{\mathop \sum \nolimits_{i = 1}^n \left| {{p_i} - {{\hat{p}}_i}} \right|}}{n}\end{eqnarray*}
 
In addition to MAEPI, we define the CIR as the proportion of predicted IOL powers \({\hat{p}_i}\) having a deviation within 0.0 D, ± 0.5 D, ± 1.0 D from the implanted IOL power pi. Specific examples for calculating the MAEPI and CIR are shown in Supplementary Materials Appendix A. Lower MAEPI and higher CIR mean better prediction performance.  
\begin{eqnarray*}CIR\left( 0 \right)\ = \ \frac{{\mathop \sum \nolimits_{i = 1}^n I\left( {\left| {{p_i} - {{\hat{p}}_i}} \right| = 0.0} \right)}}{n} \times \ 100\% \end{eqnarray*}
 
\begin{eqnarray*}CIR\left( {0.5} \right)\ = \ \frac{{\mathop \sum \nolimits_{i = 1}^n I\left( {\left| {{p_i} - {{\hat{p}}_i}} \right| \le 0.5} \right)}}{n} \times \ 100\% \end{eqnarray*}
 
\begin{eqnarray*}CIR\left( 1 \right)\ = \ \frac{{\mathop \sum \nolimits_{i = 1}^n I\left( {\left| {{p_i} - {{\hat{p}}_i}} \right| \le 1.0} \right)}}{n} \times \ 100\% \end{eqnarray*}
 
The above definitions of MAEPI and CIR assume that the formula predicts the postoperative refraction as the response variable and uses the given IOL power as an explanatory variable. If a predictive model is instead designed to predict the IOL power as the response variable according to designated target refractions, \({\hat{p}_i}\) is simply the directly output of the model. The MAEPI and CIR can be calculated with the same equations as shown above. 
Simulation Analysis: MAEPI vs. Refraction MAE
As shown in Figure 1, for a given eye and a given IOL power prediction formula, we use e (\(e \in \mathbb{R}\)) to represent the refraction PE of the implanted IOL. We denote by d the increment of the prediction value when the IOL power is increased by 0.5 D. Because the spherical equivalent refraction of the eye should decrease with increasing power of the IOL, we assume here that d is always negative. 
Figure 1.
 
Schematics of three simulation scenarios and when the IOL powers are continuous numbers. Variable pi is the implanted IOL power for the ith eye among all cases. Variable \(\hat{y}_i^p\) is the predicted postoperative refraction corresponds different IOL powers (p) for the ith eye. Variable d is the increment of the predicted refraction.
Figure 1.
 
Schematics of three simulation scenarios and when the IOL powers are continuous numbers. Variable pi is the implanted IOL power for the ith eye among all cases. Variable \(\hat{y}_i^p\) is the predicted postoperative refraction corresponds different IOL powers (p) for the ith eye. Variable d is the increment of the predicted refraction.
To characterize the behavior of the conventional metrics and the new metrics, we performed data simulations under different conditions and restrictions. Scenario (1): The IOL power and predicted refraction have a linear relationship, meaning that d < 0 is a constant for all patients. Scenario (2): The IOL power and predicted refraction have a nonlinear relationship, meaning that d < 0 is not a constant. Scenario (3): Predictions are random. 
In scenario (1), when d is a negative constant, it can be proven that the MAEPI and refraction MAE are always consistent, meaning that they always have a non-negative correlation. Consider a fictitious case A, wherein the PE of the implanted IOL is ea (eaR). We can represent the predicted IOL power as (implantedIOLpower + stepoftheIOLpower*na), where \({n_a} \in \mathbb{Z}\). The difference between the actual postoperative refraction and the predicted refraction for the predicted IOL power and is then epa =  ea −  nad. Similarly, for case B, the PE of the implanted IOL is eb (ebR), and the difference between the actual refraction and the predicted refraction for the predicted IOL power is epb =  ebnbd, where \({n_b} \in \mathbb{Z}\). Based on the mathematical derivation in Supplementary Materials Appendix B, we see that |ea| > |eb| implies |na| ≥ |nb| for arbitrary eyes A and B and thus MAEPI and refraction MAE have a non-negative correlation under the aforementioned conditions. 
In scenario (2), the MAEPI and refraction MAE are not always consistent. The corresponding counterexamples are shown in the Supplementary Materials Appendix C. In conventional vergence formulas such as the Haigis, SRK/T, Holladay 1, and Hoffer Q formula, the postoperative refraction is represented as a reciprocal function of the IOL power: 
\begin{eqnarray*}refraction = A + \frac{B}{{C + D*IOL\ power}}\end{eqnarray*}
where A, B, C, and D are functions of preoperative biometry measurements. The first derivative of this function is 
\begin{eqnarray*}\frac{{d\left( {refraction} \right)}}{{d\left( {IOL\ power} \right)}} = \ - \frac{{DB}}{{{{\left( {C + D*IOL\ power} \right)}^2},}}\end{eqnarray*}
which is not a constant because it changes with the IOL power. In the case of IOL power calculation, the first derivative is always negative, because the predicted postoperative refraction should always decrease with increasing IOL power. This above-described scenario fits the assumptions of scenario (2): d < 0 and d is not a constant. 
In scenario (3), the predictions did not depend on the input data, but were completely random. We generated the predictions for each case and each IOL power randomly based on a uniform distribution, without assuming the values of e and d. This scenario helps to demonstrate the general behavior of the MAEPI and refraction prediction MAE, with no assumptions on the characteristics of the IOL formula. 
In the above analysis we assumed a step of 0.5 D for the IOL powers. It is obvious that the general behavior of the IOL metrics will not be affected when the IOL power has a different increment step (see also Fig. 1A). To facilitate comprehension, we have provided examples in Supplementary Materials Appendix D for situations when the new metrics are consistent/inconsistent with the refraction metrics assuming continuous IOL powers. 
For the above-described three scenarios, we simulated the refraction predictions for different patients and IOL powers. The parameters used for the simulation are shown in Table 3. For scenario (1) and (2), the value of e was randomly generated for each simulated patient. A fixed value of d was used across all cases and all IOL powers for scenario (1). A random value of d was selected for each simulated patient and each IOL power for scenario (2). Schematics of each scenario are illustrated in Figure 1A, and specific examples of simulated patients are shown in the Supplementary Materials Appendix E
Table 3.
 
The Simulation Parameters
Table 3.
 
The Simulation Parameters
Patient Data Analysis
In addition to the simulation analysis, we investigated the relationship between MAEPI and refraction MAE using real patient data collected at University of Michigan's Kellogg Eye Center. We analyzed the performance of the following models on the dataset: (1) the baseline formula (2) the overfitted formulas (3) the standard formulas. Details of the calculation of predicted IOL powers are included in Supplementary Materials Appendix F
To build these models, we randomly separated the patients in the dataset into a training set (80%, 4013 patients, 5890 eyes) and a testing set (20%, 1003 patients, 1003 eyes) (Fig. 1B). One random eye was kept for patients with both eyes available in the testing set. 
The baseline formula randomly samples from a normal distribution centered around the training dataset's mean postoperative refraction. This is done to simulate a method with poor prediction accuracy, but generates plausible results given the tendency to target postoperative refractions within a narrow range. Specifically, we generated the predictions from a normal distribution where the mean equaled the mean of the postoperative refraction in the training dataset, and the standard deviation was 0.01 (to simulate a tight grouping around the mean). 
Overfitting is a term in machine learning which describes the situation when the algorithms are memorizing not only the underlying patterns but also the noise and biases present in the training dataset. With the overfitted formulas, we simulated prediction models, which have appealing prediction accuracies for the implanted IOLs in the historical dataset but fail to make accurate predictions for new unseen patients because of the lack of ability of making predictions for IOL powers that are not included in the historical dataset. Based on our experience, formulas that were developed solely or mostly based on historical datasets are especially vulnerable to overfitting. We then trained support vector machines (SVM) (implemented by scikit-learn 0.24.222) and XGBoost23 (an ML framework for gradient boosting tree-based algorithms) with the training set. SVM and XGBoost are both state-of-the-art ML frameworks, commonly used for a wide variety of applications, including ophthalmology. The preoperative biometry measurements and patient information used as features are listed as follows: the SN60WF IOL power (D), patient gender, patient age at surgery (years), eye laterality, axial length (mm), central corneal thickness (µm), aqueous depth (mm), anterior chamber depth (mm), lens thickness (mm), flat and steep keratometry (D), astigmatism (D), and white-to-white (mm). The hyperparameters for the machine learning models were optimized based on fivefold cross-validation. The models were optimized by minimizing the cross-validation MAE. The values of the hyperparameters are shown in Supplementary Materials Appendix G
For the standard formulas, we computed the refraction MAE, MedAE, SD, FPI, as well as the MAEPI and CIRs for five well-established IOL formulas: Barrett Universal II, Haigis, Hoffer Q, Holladay 1 and SRK/T, in addition to one recently published thick lens formula, PearlDGS, which uses linear regression to predict the theoretical internal lens position and estimates the refraction based on optics. A Friedman test followed by a post-hoc Wilcoxon signed-rank test with Bonferroni correction was used to compare the difference in the MAEPI and refraction MAE. The predictions of Barrett Universal II and PearlDGS were retrieved from the online calculators.24,25 The Haigis, Hoffer Q, Holladay 1, and SRK/T formulas were implemented in Python based on their publications.2633 The formula constants were optimized by zeroing out the mean prediction error in the training data (Table 4). We plotted and calculated the correlation between the IOL power prediction errors and the refraction prediction error for the above-mentioned formulas. 
Table 4.
 
The Optimized Formula Constants
Table 4.
 
The Optimized Formula Constants
We generated partial dependence plots (PDP) for the above formulas to visualize the effect of the IOL powers on the predicted refractions. The PDP calculates the predicted refraction while altering the IOL powers and keeping all the other features unchanged. The start of the predicted refraction curves is centered at zero for easier comparison. An average curve is computed by averaging the predicted refraction curves. Barrett Universal II and PearlDGS were removed from this analysis because of technical difficulty in the implementation using a web-based calculator. 
To demonstrate that the step of the IOL powers will not influence the conclusions, we computed the IOL metrics assuming the IOL power step = 1.0 D and 0.01 D. The former represents a step larger than the common 0.5 D step, and the latter resembles a continuous variable. The Barrett Universal II and PearlDGS calculators do not allow altering the IOL power step; therefore they were excluded from this analysis. 
All statistical analyses were performed with Python 3.9.5. The criterion for statistical significance was P value < 0.05. 
Results
Simulation Analysis Results
We simulated three main situations to compare the properties of the IOL prediction errors and the refraction prediction errors. The corresponding results are shown in Figure 2. Consistent with the theoretical derivation in the Methods section above, Figure 2A shows no overlap between the horizontal error intervals, which infers that when a case A has a lower absolute IOL power PE than case B, the refraction absolute error of case A can never be higher than that of case B, and vice versa. On the contrary, Figure 2B and Figure 2C show instances where a case A had a lower IOL absolute error than case B, but the refraction absolute error of case A was higher than that of case B, meaning the refraction accuracy and IOL accuracy suggest contradictory conclusions about case A and B. Increasing the number of simulated cases beyond 500 or altering the value/range of e, d, or the refraction prediction in Table 3 did not change the general characteristics of the simulation results (results not shown). 
Figure 2.
 
Scatter plots of simulation results under three conditions. (A) Refraction absolute PE versus the scatter plot of IOL power absolute PE when the d < 0 and d is a constant. (B) Refraction absolute PE versus the scatter plot of IOL power absolute PE when the d < 0 and d is not a constant. (C) Refraction absolute PE versus the scatter plot of IOL power absolute PE when the predictions are random. Each dot in the scatterplot represents a simulated case. Variable d is the signed increment of the refraction prediction when the IOL power increases by 0.5 D.
Figure 2.
 
Scatter plots of simulation results under three conditions. (A) Refraction absolute PE versus the scatter plot of IOL power absolute PE when the d < 0 and d is a constant. (B) Refraction absolute PE versus the scatter plot of IOL power absolute PE when the d < 0 and d is not a constant. (C) Refraction absolute PE versus the scatter plot of IOL power absolute PE when the predictions are random. Each dot in the scatterplot represents a simulated case. Variable d is the signed increment of the refraction prediction when the IOL power increases by 0.5 D.
Patient Data Analysis Results
To investigate the behavior of the MAEPI and CIR in clinical data, we utilized the aforementioned dataset of 5016 patients (6893 eyes) at the Kellogg Eye Center and compared the performance of different methods using a testing subset of 1003 patients (1003 eyes). A summary of the dataset is shown in Table 5. The distribution of IOL powers and postoperative refractions in the training and testing dataset are shown in Figure 3. Distribution of other measurements are shown in Supplementary Figure S1. The prediction results of different methods are shown in Table 6
Table 5.
 
The Distribution of the Dataset
Table 5.
 
The Distribution of the Dataset
Figure 3.
 
Distribution of the IOL power and postoperative refraction in the training and testing dataset. The number of bins for the bar plots was set to 30. The curve in each plot represents a gaussian kernel density estimate of the distribution.
Figure 3.
 
Distribution of the IOL power and postoperative refraction in the training and testing dataset. The number of bins for the bar plots was set to 30. The curve in each plot represents a gaussian kernel density estimate of the distribution.
Table 6.
 
Performance of Individual Methods in the Testing Set
Table 6.
 
Performance of Individual Methods in the Testing Set
The formula that provided predictions by sampling from the aforementioned normal distribution centered on the training dataset mean yielded a high MAEPI (13.719 D) and extremely low IOL accuracies. On the contrary the refraction metrics were closer to normal range: the MAE was 0.630 D; the MedAE (0.394 D) was less than 0.5 D; the SD (0.935 D) was less than 1.0 D; the percentage of errors within ± 0.5 D was 55.6%; and the FPI (0.163) was higher than those of Hoffer Q, Holladay 1, and SRK/T. 
The overfitted formulas (SVM and XGBoost) resulted in appealing refraction prediction performance, and poor IOL power prediction performance. The refraction MAE (0.329 D) and SD (0.455 D) of the SVM method were numerically close to those of the Barrett Universal II formula (0.328 D and 0.437 D). On the contrary, the MAEPI (0.562 D) of the SVM method was worse than that of SRK/T (0.545 D). 
The Pearson correlation scores between the IOL prediction errors and the refraction prediction errors are shown in Table 7. The scatter plots of the refraction absolute errors of each eye and the IOL absolute errors of each eye for each method are shown in Figure 4. “Random” formula and the “Overfitted” formulas had lower correlation coefficients compared to those of the existing formulas. To demonstrate that the same conclusions still hold when the step of IOL powers does not equal 0.5 D, we calculated the IOL power metrics assuming the step of IOL powers was 1 D and 0.01 D (Supplementary Table S1). 
Table 7.
 
The Pearson Correlation Coefficient and P Value Between the IOL Power Prediction Error and the Refraction Prediction Error
Table 7.
 
The Pearson Correlation Coefficient and P Value Between the IOL Power Prediction Error and the Refraction Prediction Error
Figure 4.
 
The scatter plots of the IOL power PE and the refraction PE for each method.
Figure 4.
 
The scatter plots of the IOL power PE and the refraction PE for each method.
Analysis of Overfitted Formulas
The PDPs of the “Random” formula, the XGBoost method, and the Holladay formula are shown in Figure 5. Partial dependence plots of other methods are shown in Supplementary Figure S2. These plots demonstrate how predicted refractions vary with different IOL powers. Curves were centered at zero at the beginning. PDPs of Holladay (Fig. 5) and other optics-based formulas (Supplementary Fig. S2) have an overall smooth, downward trend. XGBoost's PDP, however, has a significantly different appearance, suggesting that the relationship between IOL powers and predicted refractions differs significantly from optics formulas. We have selected the Holladay 1 formula for the comparison with the XGBoost method because the refraction ME, MAE, and SD of Holladay were similar to those of the XGBoost method, and yet the MAEPI was substantially different (see Table 6 and Fig. 5). The Wilcoxon test P value for the refraction MAEs of Holladay and XGBoost was not significant (P > 0.05) (complete results shown in Supplementary Table S2). The Wilcoxon test P value for the MAEPIs was significant (P < 0.05) (complete results shown in Supplementary Table S3). 
Figure 5.
 
The PDP for IOL power. IOL powers range from 6 D to 30 D. The blue lines depict how the predicted refractions change with the IOL powers for individual patients. To facilitate comparisons and visualization, the heads of blue lines are centered at zero. The thick black line with yellow highlight represents the average of all blue lines. The refraction prediction MAE (“Ref. MAE”), the standard deviation of the refraction prediction error (“Ref. SD”), and the MAEPI of the corresponding formulas are marked on the plots for the convenience of comparison (they are also listed in Table 6).
Figure 5.
 
The PDP for IOL power. IOL powers range from 6 D to 30 D. The blue lines depict how the predicted refractions change with the IOL powers for individual patients. To facilitate comparisons and visualization, the heads of blue lines are centered at zero. The thick black line with yellow highlight represents the average of all blue lines. The refraction prediction MAE (“Ref. MAE”), the standard deviation of the refraction prediction error (“Ref. SD”), and the MAEPI of the corresponding formulas are marked on the plots for the convenience of comparison (they are also listed in Table 6).
Prediction results of XGBoost and Holladay for a specific patient in the testing dataset are shown in Table 8. The actual implanted IOL power of this patient was 20.5 D, and the actual postoperative refraction was −0.0346 D. The refraction PE for XGBoost was (−0.0346) − (−0.312)  =  0.2774 D, and for Holladay 1 it was (−0.0346) − (−0.347)  =  0.3124 D. On the other hand, the IOL power PE for XGBoost can be calculated as actualIOLpowerpredictedIOLpower =  20.5 − 19  =  1.5 D, and for Holladay 1 it was 20.5 − 20 = 0.5 D
Table 8.
 
Prediction Results of One Patient From the Testing Set
Table 8.
 
Prediction Results of One Patient From the Testing Set
Discussion
In this study, we identified potential problems with the traditional metrics for IOL formulas. These metrics focus on the deviation of a formula's predicted refraction for the implanted IOL power from the true postoperative refraction. We have presented here two new metrics, the MAEPI and CIR, that are not susceptible to biases related to the clustering of real-world refraction targets in historical data. 
Based on the simulation analysis results, only under special circumstances will the IOL metrics and the refraction metrics always be consistent and as such, IOL metrics (MAEPI and CIR) should be considered essential for the assessment and optimization of the performance of IOL formulas. As shown in the patient data analysis results, the conventional metrics can generate misleading information with ML prediction results. We believe this issue is the result of the following factors: 
  • (1) The basic mechanism of ML models is to learn patterns from historical data, which sometimes makes them vulnerable to noise and biases in the training data, thus overfitting to the historical data if not properly trained. We trained machine learning models with the SVM algorithm and the XGBoost framework, and showed that both can be overfitted. It is possible to overcome overfitting through machine learning techniques such as data augmentation, resampling or by integrating theoretical components to the ML model. Although of potential interest, a review of these methods is beyond the scope of this work.
  • (2) The IOL powers in the dataset were chosen through clinical decision making, and the postoperative refractions were influenced by the IOL powers. Both quantities have a unimodal distribution (Fig. 3), meaning that the distribution has one clear peak. Because of this reason, a calculator that simply predicts random values around the mean of historical data resulted in a reasonable MAE and SD of refraction prediction errors (Table 6, “Random” method). This method of fooling the refraction metrics can be easily “learned” by machine learning algorithms. As mentioned before, the dataset can overfit the model with meaningless information or information that is specific to the dataset. In the case of IOL prediction, the postoperative refractions are not randomly drawn from a distribution but are influenced by the IOL power that is manually selected by a surgeon based on discussions of refractive target with a patient (typically between −3.0 and 0.0 D). The fact that the postoperative refraction in historical datasets follows a certain distribution can be misleading for ML models. Practically, there is a tendency for ML algorithms to take advantage of such properties during training and inadvertently develop models that are not representative of the true system underlying the observations presented. This is a generalization issue caused by the mismatch between historical datasets and unseen patients.
  • (3) When evaluating formulas using a historical dataset, the conventional metrics assume known implanted IOL powers correspond to the postoperative refraction, which is not practical in real clinical settings. In contrast to that, the IOL metrics make no assumption about which IOL power was implanted. Suppose that the patient in Table 8 were an incoming patient participating in a clinical trial testing the performance of XGBoost and Holladay 1, with the target refraction = −0.0346 D, the surgeon using XGBoost would end up picking a 19.0 D lens, and the surgeon using Holladay 1 would pick a 20.0 D lens. It is likely that clinical trial results would agree with the MAEPI rather than the refraction MAE calculated from Table 8. Therefore the IOL metrics are more intuitive and are better representations of real-life performance of IOL formulas using modern empirical methods such as ML.
As mentioned in the Introduction, previous research has shown that standard metrics are not adequate for ML with imbalanced data, both in the case of regression and classification.6,7,34 However, the existing evaluation frameworks for imbalanced regression does not provide an immediate solution for evaluating IOL formulas’ performance. The regression error characteristics curve, which plots the absolute deviation tolerance versus accuracy, can be viewed as an expanded version of proportions of patients with errors within different intervals.11 The regression error characteristics curve provides a description of the cumulative distribution function of the errors; however, its limitation is also recognized in previous research.12 The standard deviation of the prediction error was recently put forth as the “single best parameter” to characterize the performance of an IOL formula.4 However, the article investigated only geometric optics-based formulas that have been tested extensively over the years. From the point of view of a surgeon, a critical question is whether reported evaluation metrics are able to predict the real-life performance of a new formula. In this study, we have demonstrated that the SD, as well as the other traditional metrics such as MAE, MedAE, and the percentage of patients in different error intervals, can be easily fooled by ML methods (Table 6). The Random, SVM and XGBoost model achieved abnormal high FPI (Table 6), which implies that the FPI is not a reliable evaluation metric. 
Well-performing IOL formulas should have low MAEPI, as well as low refraction MAE and SD. As previously shown, SVM's MAEPI was worse than SRK/T's, but its refraction MAE was similar to Barrett's. Discrepancies between IOL power-based metrics and refraction-based metrics indicate the model's limitations in prediction. When comparing the performance of existing formulas, it is normal that a formula outperforms another in multiple metrics. However, this will not eliminate the necessity of taking each of these metrics into consideration, because such results are not guaranteed to be generalizable to all relevant metrics. We believe the traditional metrics and the new IOL power metrics both describe important aspects of the performance of a formula, and therefore we recommend using both at the same time. 
In this study, we assumed that the IOL formula predicts the postoperative refraction as a function of the IOL power. On the other hand, if the IOL formula is formulated as predicting the appropriate IOL power based on a specific refraction, then the MAEPI can simply be calculated as the mean absolute prediction error based on the standard equation. However, there is a persistent imbalance in the data, so the ML model will tend to predict common IOL powers (e.g., near 20 D) more frequently, regardless of how the refraction is specified. As an example, a model that outputs random IOL powers in a narrow interval around 20 D is capable of achieving a low MAEPI, which is similar to the results we obtained from the Random model in Table 6. Therefore we recommend that in this scenario, both IOL power-based metrics and refraction-based metrics still need to be calculated to obtain a more accurate assessment of the formula's performance. 
In our dataset the spherical component and cylindrical component had a step of 0.25 D, and the IOL powers had a step of 0.5 D. The presented analysis did not assume a fixed step for the measured refraction, therefore if the refraction had a different step or became continuous the results and conclusion would still be the same. We have also demonstrated in the Methods and Results sections that the conclusions drawn do not depend on the step of the IOL powers. 
Despite the benefits of using the IOL accuracy metrics, we have identified the following limitations: (1) Calculating the IOL accuracy metrics requires the calculation of multiple refraction predictions around the true postoperative refraction. In addition, it may require programming knowledge and may consume more computing resources compared to the traditional metrics, because the predictions of multiple IOL powers have to be computed for each patient to determine the predicted IOL power. This issue might be especially prominent for machine learning models since the scoring metrics may need to be computed repetitively during model selection. (2) The refraction MAE can be used as the loss function for optimization algorithms in machine learning. On the contrary, the MAEPI cannot be used directly as a loss function. 
In sum, we have demonstrated the potential pitfalls when using traditional metrics for performance evaluation of IOL formulas, including refraction MAE and SD. It is the purpose of this article to elucidate the erroneous results that traditional metrics might produce when applied to ML-based methods. We recommend using the newly proposed IOL metrics, MAEPI and CIR, in addition to the conventional metrics for evaluating the performance of an IOL formula, especially for pure or largely ML-based formulas. 
Acknowledgments
Supported by The Lighthouse Guild, New York, NY (J.D.S.); National Eye Institute, Bethesda, MD, 1R01EY026641-01A1 (J.D.S.), NIH K12EY022299 (N.N.). 
Disclosure: T. Li, None; J.D. Stein, None; N. Nallasamy, None 
References
Wang L, Koch DD, Hill W, Abulafia A. Pursuing perfection in intraocular lens calculations: III. Criteria for analyzing outcomes. J Cataract Refract Surg. 2017; 43: 999–1002. [CrossRef] [PubMed]
Hoffer KJ, Savini G. Update on intraocular lens power calculation study protocols: The better way to design and report clinical trials. Ophthalmology. 2021; 128(11): e115–e120. [CrossRef] [PubMed]
Hoffer KJ, Aramberri J, Haigis W, et al. Protocols for studies of intraocular lens formula accuracy. Am J Ophthalmol. 2015; 160: 403–405.e1. [CrossRef] [PubMed]
Holladay JT, Wilcox RR, Koch DD, Wang L. Review and recommendations for univariate statistical analysis of spherical equivalent prediction error for IOL power calculations. J Cataract Refract Surg. 2021; 47: 65–77. [CrossRef] [PubMed]
Gatinel D, Debellemanière G, Saad A, Rampat R. Theoretical relationship among effective lens position, predicted refraction, and corneal and intraocular lens power in a pseudophakic eye model. Transl Vis Sci Technol. 2022; 11: 5. [CrossRef] [PubMed]
Japkowicz N. Assessment metrics for imbalanced learning. Imbalanced Learn Found Algorithms. Appl 2013: 187–206.
Moniz N, Branco P, Torgo L. Evaluation of ensemble methods in imbalanced regression tasks. In: First International Workshop on Learning with Imbalanced Domains: Theory and Applications. PMLR; 2017: 129–140.
Christoffersen PF, Diebold FX. Optimal prediction under asymmetric loss. Econ Theory. 1997; 13: 808–817. [CrossRef]
Torgo L, Ribeiro R. Utility-based regression. In: European conference on principles of data mining and knowledge discovery. Springer; 2007: 597–604.
Almeida Ribeiro RP. Utility-based regression. Porto, Portugal: University of Porto. 2011. Dissertation.
Bi J, Bennett KP. Regression error characteristic curves. In: Proceedings of the 20th international conference on machine learning (ICML-03). 2003: 43–50.
Torgo L. Regression error characteristic surfaces. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. 2005: 697–702.
Hernández-Orallo J. ROC curves for regression. Pattern Recognit. 2013; 46: 3395–3411. [CrossRef]
Rosset S, Perlich C, Zadrozny B. Ranking-based evaluation of regression models. Knowl Inf Syst. 2007; 12: 331–353. [CrossRef]
Li T, Stein J, Nallasamy N. Evaluation of the Nallasamy formula: A stacking ensemble machine learning method for refraction prediction in cataract surgery [published online ahead of print April 4, 2022]. Br J Ophthalmol. https://doi.org/10.1136/bjophthalmol-2021-320599.
Stein JD, Rahman M, Andrews C, et al. Evaluation of an algorithm for identifying ocular conditions in electronic health record data. JAMA Ophthalmol. 2019; 137: 491–497. [CrossRef] [PubMed]
Bommakanti NK, Zhou Y, Ehrlich JR, et al. Application of the sight outcomes research collaborative ophthalmology data repository for triaging patients with glaucoma and clinic appointments during pandemics such as COVID-19. JAMA Ophthalmol. 2020; 138: 974–980. [CrossRef] [PubMed]
Li T, Stein J, Nallasamy N. AI-powered effective lens position prediction improves the accuracy of existing lens formulas. Br J Ophthalmol. 2022; 106: 1222–1226. [CrossRef] [PubMed]
Li T, Yang K, Stein JD, Nallasamy N. Gradient boosting decision tree algorithm for the prediction of postoperative intraocular lens position in cataract surgery. Transl Vis Sci Technol. 2020; 9: 38. [CrossRef] [PubMed]
Zhang Y, Li T, Reddy A, Nallasamy N. Gender differences in refraction prediction error of five formulas for cataract surgery. BMC Ophthalmol. 2021; 21: 183. [CrossRef] [PubMed]
Simpson MJ, Charman WN. The effect of testing distance on intraocular lens power calculation. J Refract Surg. 2014; 30: 726. [CrossRef] [PubMed]
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011; 12: 2825–2830.
Chen T, Guestrin C. XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery; 2016: 785–794.
Barrett Universal II Formula V1.05. Available at: https://calc.apacrs.org/barrett_universal2105/ . Accessed August 31, 2021.
Solver. Available at: https://iolsolver.com/ . Accessed January 29, 2022.
Retzlaff JA, Sanders DR, Kraff MC. Development of the SRK/T intraocular lens implant power calculation formula. J Cataract Refract Surg. 1990; 16: 333–340. [CrossRef] [PubMed]
Hoffer KJ. The Hoffer Q formula: A comparison of theoretic and regression formulas. J Cataract Refract Surg. 1993; 19: 700–712. [CrossRef] [PubMed]
Haigis W, Lege B, Miller N, Schneider B. Comparison of immersion ultrasound biometry and partial coherence interferometry for intraocular lens calculation according to Haigis. Graefe's Arch Clin Exp Ophthalmol. 2000; 238: 765–773. [CrossRef]
Holladay JT, Musgrove KH, Prager TC, et al. A three-part system for refining intraocular lens power calculations. J Cataract Refract Surg. 1988; 14: 17–24. [CrossRef] [PubMed]
Anon. Correction. J Cataract Refract Surg. 1994; 20: 677. [CrossRef]
Anon. Erratum. J Cataract Refract Surg. 1990; 16: 528. [PubMed]
Zuberbuhler B, Morrell AJ. Errata in printed Hoffer Q formula. J Cataract Refract Surg. 2007; 33: 2. [CrossRef] [PubMed]
Hoffer KJ. Reply: Errata in printed Hoffer Q formula. J Cataract Refract Surg. 2007; 33: 2–3. [CrossRef] [PubMed]
Lemnaru C, Potolea R. Imbalanced classification problems: Systematic study, issues and best practices. In: International Conference on Enterprise Information Systems. Berlin: Springer; 2011: 35–50.
Figure 1.
 
Schematics of three simulation scenarios and when the IOL powers are continuous numbers. Variable pi is the implanted IOL power for the ith eye among all cases. Variable \(\hat{y}_i^p\) is the predicted postoperative refraction corresponds different IOL powers (p) for the ith eye. Variable d is the increment of the predicted refraction.
Figure 1.
 
Schematics of three simulation scenarios and when the IOL powers are continuous numbers. Variable pi is the implanted IOL power for the ith eye among all cases. Variable \(\hat{y}_i^p\) is the predicted postoperative refraction corresponds different IOL powers (p) for the ith eye. Variable d is the increment of the predicted refraction.
Figure 2.
 
Scatter plots of simulation results under three conditions. (A) Refraction absolute PE versus the scatter plot of IOL power absolute PE when the d < 0 and d is a constant. (B) Refraction absolute PE versus the scatter plot of IOL power absolute PE when the d < 0 and d is not a constant. (C) Refraction absolute PE versus the scatter plot of IOL power absolute PE when the predictions are random. Each dot in the scatterplot represents a simulated case. Variable d is the signed increment of the refraction prediction when the IOL power increases by 0.5 D.
Figure 2.
 
Scatter plots of simulation results under three conditions. (A) Refraction absolute PE versus the scatter plot of IOL power absolute PE when the d < 0 and d is a constant. (B) Refraction absolute PE versus the scatter plot of IOL power absolute PE when the d < 0 and d is not a constant. (C) Refraction absolute PE versus the scatter plot of IOL power absolute PE when the predictions are random. Each dot in the scatterplot represents a simulated case. Variable d is the signed increment of the refraction prediction when the IOL power increases by 0.5 D.
Figure 3.
 
Distribution of the IOL power and postoperative refraction in the training and testing dataset. The number of bins for the bar plots was set to 30. The curve in each plot represents a gaussian kernel density estimate of the distribution.
Figure 3.
 
Distribution of the IOL power and postoperative refraction in the training and testing dataset. The number of bins for the bar plots was set to 30. The curve in each plot represents a gaussian kernel density estimate of the distribution.
Figure 4.
 
The scatter plots of the IOL power PE and the refraction PE for each method.
Figure 4.
 
The scatter plots of the IOL power PE and the refraction PE for each method.
Figure 5.
 
The PDP for IOL power. IOL powers range from 6 D to 30 D. The blue lines depict how the predicted refractions change with the IOL powers for individual patients. To facilitate comparisons and visualization, the heads of blue lines are centered at zero. The thick black line with yellow highlight represents the average of all blue lines. The refraction prediction MAE (“Ref. MAE”), the standard deviation of the refraction prediction error (“Ref. SD”), and the MAEPI of the corresponding formulas are marked on the plots for the convenience of comparison (they are also listed in Table 6).
Figure 5.
 
The PDP for IOL power. IOL powers range from 6 D to 30 D. The blue lines depict how the predicted refractions change with the IOL powers for individual patients. To facilitate comparisons and visualization, the heads of blue lines are centered at zero. The thick black line with yellow highlight represents the average of all blue lines. The refraction prediction MAE (“Ref. MAE”), the standard deviation of the refraction prediction error (“Ref. SD”), and the MAEPI of the corresponding formulas are marked on the plots for the convenience of comparison (they are also listed in Table 6).
Table 1.
 
Metrics for Imbalanced Regression
Table 1.
 
Metrics for Imbalanced Regression
Table 2.
 
A Summary of the Variables Used in This Study
Table 2.
 
A Summary of the Variables Used in This Study
Table 3.
 
The Simulation Parameters
Table 3.
 
The Simulation Parameters
Table 4.
 
The Optimized Formula Constants
Table 4.
 
The Optimized Formula Constants
Table 5.
 
The Distribution of the Dataset
Table 5.
 
The Distribution of the Dataset
Table 6.
 
Performance of Individual Methods in the Testing Set
Table 6.
 
Performance of Individual Methods in the Testing Set
Table 7.
 
The Pearson Correlation Coefficient and P Value Between the IOL Power Prediction Error and the Refraction Prediction Error
Table 7.
 
The Pearson Correlation Coefficient and P Value Between the IOL Power Prediction Error and the Refraction Prediction Error
Table 8.
 
Prediction Results of One Patient From the Testing Set
Table 8.
 
Prediction Results of One Patient From the Testing Set
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×