June 2024
Volume 13, Issue 6
Open Access
Artificial Intelligence  |   June 2024
Predicting Glaucoma Surgical Outcomes Using Neural Networks and Machine Learning on Electronic Health Records
Author Affiliations & Notes
  • Samuel Barry
    Department of Management Science & Engineering, Stanford University, Stanford, CA, USA
  • Sophia Y. Wang
    Byers Eye Institute, Department of Ophthalmology, Stanford University, Stanford, CA, USA
  • Correspondence: Sophia Y. Wang, 2370 Watson Court, Palo Alto, CA 94303, USA. e-mail: sywang@stanford.edu 
Translational Vision Science & Technology June 2024, Vol.13, 15. doi:https://doi.org/10.1167/tvst.13.6.15
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Samuel Barry, Sophia Y. Wang; Predicting Glaucoma Surgical Outcomes Using Neural Networks and Machine Learning on Electronic Health Records. Trans. Vis. Sci. Tech. 2024;13(6):15. https://doi.org/10.1167/tvst.13.6.15.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose: To develop machine learning (ML) and deep learning (DL) models to predict glaucoma surgical outcomes, including postoperative intraocular pressure, use of ocular antihypertensive medications, and need for repeat surgery.

Methods: We identified glaucoma surgeries performed at Stanford from 2013–2024, with two or more postoperative visits with intraocular pressure (IOP) measurement. Patient features were identified from the electronic health record (EHR), including demographics, prior diagnosis and procedure codes, medications and eye exam findings. Classical ML and DL models were developed to predict which glaucoma surgeries would result in surgical failure, defined as (1) IOP not reduced by more than 20% of preoperative baseline on two consecutive postoperative visits, (2) increased classes of glaucoma medications, and (3) need for additional glaucoma surgery or revision of original surgery.

Results: A total of 2398 glaucoma surgeries of 1571 patients were included, of which 1677 surgeries met failure criteria. Random forest performed best for prediction of overall surgical failure, with accuracy of 75.5% and area under the receiver operator curve (AUROC) of 76.7%, similar to the deep learning model (accuracy 75.5%, AUROC 76.6%). Across all models, prediction performance was better for IOP outcomes (AUROC 86%) than need for an additional surgery (AUROC 76%) or need for additional glaucoma medication (AUC 70%).

Conclusions: ML and DL algorithms can predict glaucoma surgery outcomes using structured data inputs from EHRs.

Translational Relevance: Models that predict outcomes of glaucoma surgery may one day provide the basis for clinical decision support tools supporting surgeons in personalizing glaucoma treatment plans.

Introduction
Glaucoma is the leading cause of irreversible blindness worldwide, and its prevalence is estimated to grow by more than 50% between 2020 and 2040, primarily because of the aging of the population.1 Patients facing glaucoma surgery are those with the most severe disease, who likely have already lost vision and are likely to lose more without intervention. However, the outcome of glaucoma surgery can be variable and unpredictable, with some patients experiencing surgical “failure” at early time points and others maintaining good disease control for lengthy periods of time.2 Although a number of predictors of glaucoma surgical failure or success have been demonstrated, most previous work has considered relatively few patient features.2 Because every patient and their disease comprise a unique and complex combination of many clinical factors, it can be difficult for clinicians to precisely predict their disease course and postsurgical outcomes. 
Previous work using machine learning and deep learning techniques on electronic health records (EHRs) have shown promise in predicting various ophthalmic outcomes, such as the likelihood of developing diabetic retinopathy,3 the complication rate of cataract surgeries,4 or the probability of glaucoma patients to progress to requiring surgery.5 One prior study explored various prediction model architectures to predict the success or failure of trabeculectomy surgery at one year based on postoperative IOP control in a relatively small population of 200 patients.6 A more recent article also explored the use of free-text operative notes and structured EHR data from the preoperative and early postoperative period to predict the IOP outcomes of trabeculectomy surgeries on a cohort of 1326 patients.7 
The purpose of this study was to develop and evaluate machine learning algorithms to predict the outcomes of glaucoma surgery, including trabeculectomy, tube shunts, minimally invasive glaucoma surgeries (MIGS), and cyclodestructive procedures. Because success and failure of glaucoma surgery is typically multifactorial, we developed models that predict an overall surgical failure based on composite criteria including IOP control, medication usage, and need for repeat glaucoma surgery. We also developed models that predict individual failure criteria, thus providing maximum flexibility in model use and interpretation. Ultimately, such models could be used as the basis of clinical decision support aids which could use preoperative clinical data to predict the probability of surgical success of different glaucoma surgical approaches using success criteria that the surgeon deems most relevant, thus enabling the surgeon to tailor their surgical approaches to achieve the best outcome. 
Methods
Data Source and Cohort
We identified adult patients (>18 years old at the first encounter) from STARR (the Stanford Clinical Warehouse)8 who underwent glaucoma surgery from 2013 to 2024, including trabeculectomy and ExPress shunts (Current Procedural Terminology codes 66170, 66172, 66160, 66183), tube shunts (66179, 66180), minimally invasive glaucoma surgery (MIGS, 0191T, 0192T, 66989, 66991, 0253T, 0474T, 0376T, 66174, 66175, 65820, 65850) and cyclophotocoagulation procedures (66710, 66711, 66720, 66740, 66987, 66988). Surgical patients must have had at least two postoperative visits with IOP measurement in the operative eye. Our cohort included 1571 patients who underwent a total of 2398 surgeries. This study was approved by the Stanford University Institutional Review Board and adhered to the tenets of the Declaration of Helsinki. 
Outcome Definition/Prediction Target
Surgeries were considered to be successful when postoperative intraocular pressure (IOP) was reduced by more than 20% compared to preoperative baseline, without an increase in the number of glaucoma medications and without further glaucoma surgery. If any of the following individual failure criteria were met, the surgery was considered unsuccessful: (1) IOP failure: IOP above 80% of the preoperative IOP on two consecutive postoperative visits subsequent to the initial three months of the postoperative period; (2) medication failure: patients were on more categories of glaucoma medications postoperatively than preoperatively, including among topical carbonic anhydrase inhibitors, beta blockers, alpha agonist, prostaglandins, miotics, oral carbonic anhydrase inhibitors, rho kinase inhibitor; (3) follow-up glaucoma surgery failure: when the patient underwent a second glaucoma surgery or a revision for the original surgery in the same eye within three months after the original surgery. In addition, we also developed models that implemented several alternative thresholds for IOP failure in accordance with the World Glaucoma Association's Guidelines for the Design and Reporting of Glaucoma Trials9: IOP > 12, 15, 18 or 21 mm Hg at two consecutive postoperative visits and IOP above 80% of preoperative IOP at two consecutive postoperative visits. 
Feature Engineering
Input features were extracted from the EHR and included demographic information, past ocular surgeries, diagnoses, medications, social history (health-related behaviors), and ophthalmology-specific clinical exam findings. Categorical features were transformed to dummy encoded features (0 or 1) and continuous numerical variables were standardized to have a mean of 0 and a variance of 1. All feature values were collected at baseline, from the preoperative period. 
Categorical features included surgery CPT code, race, ethnicity, gender, prior encounter diagnoses (International Classification of Disease [ICD] codes), preoperative medications, number and type of prior glaucoma surgeries, concurrent cataract surgery (CPT code 66982 or 66984 on the same day as glaucoma surgery), tobacco/alcohol/drug use, contact/glasses use, prior selective laser trabeculoplasty, and prior laser peripheral iridotomy. Categorical features were dummy-encoded. Medications included all prescriptions in the five years before the operation based on EHR medication records. Medications were mapped to their generic name and included ophthalmic medications and systemic medications. Medication features were represented as Boolean variables, with 1 indicating if the patient was prescribed this medication before surgery and 0 otherwise. Variance elimination was performed to keep the 100 medication features with the highest variance each for ophthalmic medications and for systemic medications. ICD codes were aggregated to the level of two numbers after the decimal (e.g., H40.1212 became H40.12). Each ICD feature was also represented as a Boolean variable, 1 if the patient had a preoperative encounter associated with this diagnosis, 0 otherwise. Variance elimination was performed to include the 100 diagnosis code variables with the highest variance. 
Continuous variables included age at the time of surgery, latest preoperative value of eye exam findings for the surgical eye (IOP, best recorded visual acuity [VA], central corneal thickness, refraction spherical equivalent), number of prior ophthalmic surgeries. VA was converted to mean logarithm of the minimum angle of resolution (logMAR). Other continuous variables were standardized to mean 0 and standard deviation of 1. Missing values for eye examination findings were imputed using the column mean, and an indicator column was created to indicate whether the measurement was missing and thus imputed (<6% rate of missingness overall, with 0% missingness for IOP measurements). There were a total of 389 input features, including 100 features each for diagnoses, systemic medications, and ophthalmic medications. 
A held-out test set was reserved using 20% of the cohort data (N = 480 surgeries), while ensuring that no patient appeared both in the training and the test set. The remaining N = 1918 surgeries were used for training and five-fold cross-validation. 
Modeling Approach
We trained several machine learning classifiers including decision trees, random forest, XGBoost, penalized logistic regression, multi-layer perceptron, k-nearest neighbors, gaussian naive bayes, linear discriminant analysis, and support vector machines using the Python scikit-learn library (version 1.0.2).10 Each model was trained to predict overall failure, as well as failure by IOP, medication, or need for an additional glaucoma surgery. For each model, hyperparameter tuning was performed using grid search and five-fold cross-validation on the training set (Supplementary Table S1), and final results were obtained by evaluating the best model on the test set. The classification threshold was tuned for best accuracy. 
Several feedforward deep learning models were also benchmarked. The architectures had between one and six hidden layers, with activation functions set to ReLU, LeakyReLU, Tanh, or sigmoid. The network optionally contained an embedding layer to reduce the diagnoses, general medication, and ophthalmology medication features from three sparse 100-dimensional vectors to three dense 10-dimensional ones. Dropout layers were added to prevent overfitting, and hyperparameters such as the optimizer type, loss function (weighted for class imbalance), learning rate, dropout rate, and classification threshold were also compared. Early stopping was implemented based on validation loss with a patience of 10. 
Evaluation
Standard Evaluation Metrics
All models were evaluated on the held-out dataset using standard classification metrics, including accuracy, area under the receiver operating characteristic curve (AUROC), area under the precision-recall curve, precision (positive predictive value), negative predictive value, recall (sensitivity), specificity, and F1 score (the harmonic mean of precision and recall). Metrics and their confidence intervals were derived using a clustered bootstrapping approach11,12 to account for the nonindependence within the test set resulting from rare instances where patients underwent multiple glaucoma surgeries. 
Explainability
To understand which features were most important to our prediction models, we used SHapley Additive exPlanations (SHAP) values.13 This technique computes the Shapley values for each feature, a concept originally used in game theory to measure the contribution of each player in a cooperative game. In the context of machine learning, this method calculates the marginal additional contribution of each feature to the model prediction. The package SHAP was leveraged to create a TreeExplainer, which estimates the SHAP values for tree-based or ensemble models, and applied it to the random forest model, the best-performing model.14 
Results
Population Characteristics
Overall, 1571 patients were included in the cohort for a total of 2398 glaucoma surgeries. Seven hundred twenty-one surgeries (30.1%) were successful, whereas 1677 (69.9%) met composite failure criteria (Fig. 1). Among the three individual failure criteria, IOP failure was the most common (N = 1443 [86.0%]). Medication failure, whereby the patient needs more classes of ocular antihypertensive medications after surgery than before, was reported in 539 (32.1%) instances of surgery failure, and 359 (21.4%) surgeries failed due to need for additional glaucoma surgery. The failure rate varied by procedure type: tube shunt (57.1%, 373/653), trabeculectomy (60.4%, 336/556), cyclophotocoagulation procedures (72.9%, 255/350), and MIGS (85.0%, 713/839). Failure rates for the alternative IOP success criteria (IOP reduction of 20% or IOP ≤12, 15, 18, or 21 mm Hg) ranged from 38% to 66% and are reported in Supplementary Table S5
Figure 1.
 
Causes of glaucoma surgical failure. The Venn diagram shows the number of surgeries that failed according to three separate failure criteria: (1) Failure by IOP, defined as postoperative reduction of less than20% compared to preoperative levels; (2) Failure by medication: Need for more classes of glaucoma medications after surgery than before; and (3) Need for additional glaucoma surgery or revision of original surgery within three months.
Figure 1.
 
Causes of glaucoma surgical failure. The Venn diagram shows the number of surgeries that failed according to three separate failure criteria: (1) Failure by IOP, defined as postoperative reduction of less than20% compared to preoperative levels; (2) Failure by medication: Need for more classes of glaucoma medications after surgery than before; and (3) Need for additional glaucoma surgery or revision of original surgery within three months.
Population characteristics are summarized in Table 1. The mean age of the population was 72.5 (SD = 15.4) years old, and 745 (47.4%) of the patients were women. The majority of the cohort were Asian (N = 515 [32.8%]) or White (N = 499 [31.8%]). The overall mean preoperative IOP was 21.5 mm Hg (SD = 10.2). Mean preoperative LogMAR visual acuity was 0.72 (SD = 0.9), approximately equivalent to 20/90 Snellen visual acuity. Overall mean spherical equivalent was −1.56 D (SD = 4.3). Approximatively two-thirds of the population (67.7%) had preoperative exposure to latanoprost, 57.2% for brimonidine, and 35.0% for acetazolamide. 
Table 1.
 
Population Characteristics
Table 1.
 
Population Characteristics
Machine Learning and Deep Learning Model Results
We developed several machine learning and deep learning models to predict glaucoma surgical failure. Receiver operating characteristic and precision recall curves on the held-out test set for overall surgical failure are depicted in Figure 2, with additional classification metrics shown in Table 2. The best model was random forest, which outperformed other models on the three most important performance metrics (AUROC = 0.767, Accuracy = 0.755, F1: 0.850). The models with the next best AUROC were the neural network with feature embedding (0.766) and logistic regression (0.765), with the remainder of the models AUROC ranging between ∼0.66-0.76. 
Figure 2.
 
Receiver operating characteristic and precision-recall curves for models predicting overall glaucoma surgical failure The figures show the performance of different machine learning and deep learning models on predicting overall glaucoma surgical failure on the held-out test set. The legend indicates the model type and the area under the curve. SVM, support vector machine; MLP, multilayer perceptron; GaussianNB, Gaussian naïve Bayes.
Figure 2.
 
Receiver operating characteristic and precision-recall curves for models predicting overall glaucoma surgical failure The figures show the performance of different machine learning and deep learning models on predicting overall glaucoma surgical failure on the held-out test set. The legend indicates the model type and the area under the curve. SVM, support vector machine; MLP, multilayer perceptron; GaussianNB, Gaussian naïve Bayes.
Table 2.
 
Model Performance for Prediction of Overall Glaucoma Surgical Failure
Table 2.
 
Model Performance for Prediction of Overall Glaucoma Surgical Failure
Figure 3 displays the AUROC of the models on each of the surgical failure criteria. Logistic regression and multi-layer perceptron were the best-performing models for medication failure (AUROC = 0.696, tie), logistic regression gave the best results for failure due to the need for additional glaucoma surgery (AUROC = 0.760), and gradient boosting was the best-performing model for failure by IOP (AUROC = 0.855). Supplementary Figure S1 shows accuracy of the models on each of the surgical failure criteria. Additional classification performance metrics for each model on the individual failure criteria are reported in Supplementary Tables S2, S3, and S4. Results for IOP and overall surgery failure using the alternative IOP failure criteria are reported in Supplementary Tables S6 and S7, with AUROC varying from 0.655 to 0.724 across criteria for overall surgery failure. 
Figure 3.
 
AUROC for models predicting overall surgical failure and individual surgical failure criteria. The bars represent the test set AUROC for each model on the individual failure criteria with the best set of hyperparameters. 95% confidence intervals are depicted using error bars. Dec. Tree, decision tree; MLP, multilayer perceptron; NB, Gaussian naïve Bayes; KNN, K-nearest neighbors; LDA, linear discriminant analysis; Log. Reg., logistic regression; SVM, support vector machine; RF, random forest; NN, neural network.
Figure 3.
 
AUROC for models predicting overall surgical failure and individual surgical failure criteria. The bars represent the test set AUROC for each model on the individual failure criteria with the best set of hyperparameters. 95% confidence intervals are depicted using error bars. Dec. Tree, decision tree; MLP, multilayer perceptron; NB, Gaussian naïve Bayes; KNN, K-nearest neighbors; LDA, linear discriminant analysis; Log. Reg., logistic regression; SVM, support vector machine; RF, random forest; NN, neural network.
Explainability
To evaluate which features were most important for model predictions of glaucoma surgical failure, we calculated Shapley values for each feature using the random forest model, which was the best-performing structured model for prediction of overall surgery failure (Fig. 4). The goal of the explainability analysis is not necessarily to identify novel risk factors, for which a traditional statistical inference model is preferred, but rather to shed light on what features the model relies upon and investigate whether they may be spurious or reasonable. Features with high absolute magnitude of Shapley values are important to the model prediction; positive values indicate importance for prediction of failure, while negative values indicate importance for prediction of surgical success. Among the top 20 features most important for prediction were clinically reasonable parameters such as IOP, visual acuity, spherical equivalent, use of glaucoma medications (such as brimonidine tartrate and acetazolamide), and the type of glaucoma surgery. 
Figure 4.
 
Most important features for model prediction using Shapley analysis. This figure displays the Shapley values for the top 20 most important features in predicting the outcome of a surgery, using the best-performing classical machine learning model (random forest). Each dot represents an individual in the cohort. Features are listed vertically on the Y-axis, and their ranking is determined by their importance in the predictive model. The impact of each feature on the model's prediction is described on the X-axis. A value near 0 indicates little to no impact on the prediction, whereas values further to the left or right indicate a negative or positive impact on the prediction, respectively. The color of each dot provides insight into the actual value of that feature for the individual data point (blue: low feature value; red: high feature value).
Figure 4.
 
Most important features for model prediction using Shapley analysis. This figure displays the Shapley values for the top 20 most important features in predicting the outcome of a surgery, using the best-performing classical machine learning model (random forest). Each dot represents an individual in the cohort. Features are listed vertically on the Y-axis, and their ranking is determined by their importance in the predictive model. The impact of each feature on the model's prediction is described on the X-axis. A value near 0 indicates little to no impact on the prediction, whereas values further to the left or right indicate a negative or positive impact on the prediction, respectively. The color of each dot provides insight into the actual value of that feature for the individual data point (blue: low feature value; red: high feature value).
Discussion
In this study, we developed and evaluated several machine learning and deep learning algorithms to predict glaucoma surgery outcomes using structured EHR data. Our approach also included training models for different types of failure criteria, including intraocular pressure range, glaucoma medication use, need for additional glaucoma surgery, and a composite of all three criteria. This design choice was crucial as glaucoma surgery can be judged to be successful using different criteria whose importance may vary according to each individual surgeon's preferences; thus, models which retain flexibility in predicting different outcomes may be most useful for a clinician use them as the basis for clinical decision support. We demonstrated that our EHR models could provide predictive value for the task of predicting glaucoma surgical outcomes as most algorithms yielded an AUROC above 0.70 using only preoperative features available in a real-world clinical setting. The best performing model for predicting overall glaucoma surgical failure was a neural network model, and the random forest model was the best performing classical machine learning algorithm. 
Our work advances the field in several ways. Previous work using machine learning to predict trabeculectomy outcomes in a cohort of 230 trabeculectomies demonstrated an AUROC of 0.72 and an accuracy of 68% using random forest.6 Another previous study predicting trabeculectomy outcomes in a larger cohort of 1326 patients achieved an AUROC of 0.750, but their approach used free-text operative notes and structured EHR data from both the preoperative and early postoperative period.7 Thus, such a model could never be used to predict the outcome of surgery that had not yet been performed, limiting its clinically applicability. Our models outperformed previous work using only preoperative information to predict the surgical outcomes of multiple types of glaucoma surgeries, including filtering, minimally invasive glaucoma surgeries and ciliary body destructive procedures. In addition, we developed models predicting multiple types of surgical failure including multiple alternatives for IOP failure. Despite a more complex prediction formulation, our random forest model was still able to outperform prior models by achieving AUROC of 0.767 on overall surgical failure. For models predicting surgical failure by individual criteria, the best performing models were gradient boosting for failure by IOP (AUROC = 0.855), logistic regression for failure by follow-up surgery (AUROC = 0.760) and logistic regression and multilayer perceptron for failure by medication (AUROC = 0.696). IOP failure may have been easier to predict because preoperative IOP is likely a strong predictor of the postoperative IOP. On the other hand, failure by medication or follow-up surgeries may have been more difficult to predict because of greater class imbalance, which makes models more difficult to train. Improving the performance over previous work is a significant achievement and marks a milestone toward developing clinical decision support tools that could help in glaucoma surgical planning by predicting expected outcomes of glaucoma surgery. Such a tool could prove very useful as 175,000 glaucoma surgeries are performed each year in the United States alone,15 and the outcome of these surgeries can be widely variable2 and unpredictable. 
We were also able to perform explainability studies to determine which factors were most important for prediction. Our analysis showed that high preoperative IOP, low visual acuity, young age and use of second-line IOP lowering topical drugs (e.g., brimonidine) were important features for prediction, which are all clinically reasonable, because they may indicate more-severe glaucoma. Some of these factors have also been associated in previous literature with outcomes of trabeculectomy and other glaucoma surgeries, including young age, high preoperative IOP, low visual acuity, and large number of IOP lowering topical medications.1620 Shapley values for explainability were not computed on the neural network as the embedding layer in this model greatly diminishes the ability to interpret Shapley values. This is a drawback of our study, and more generally of deep learning models which are widely considered to be less interpretable.21 In our study, random forest was found to be the best performing model, with performance slightly above that of the to the deep learning architecture. This is consistent with results of many prior studies demonstrating the superiority of random forest algorithms for EHR data,2225 likely because of its abilities to handle a large number of noisy input features usually associated with EHR models. With the advantages of tree-based machine learning models over deep learning models in domains such as explainability, in practice, the choice of model to deploy may depend on other factors besides performance alone, such as type of data (structured vs. unstructured), available computational resources, the aforementioned explainability, and others. 
Our work has several limitations and remaining challenges that could present opportunities for future research. First, the cohort consisted of patients from a single academic center; future studies are planned using multicenter data and investigating external generalizability. It is also possible that patients may have received previous diagnoses or ophthalmic surgeries that were not captured in our EHR if they were referred from other institutions or may seek follow-up care or surgeries from other institutions not captured in our EHR. Applying natural language processing techniques to progress notes may be able to mitigate these limitations in the future. The study also has several limitations commonly associated with use of EHR data, such as the potential for coding or medication inaccuracies, particularly because patients can be told by their physician to discontinue medications verbally, without updating the EHR to reflect this change.26 Thus it is often difficult to precisely determine the medication usage timeline. However, prescription of a new medication typically requires an order that is then captured in the EHR. Considering these limitations of EHR-based determination of medication usage, a surgery was concerned to meet failure criteria if postoperative glaucoma medications exceeded preoperative glaucoma medications at any time point, which may be a conservative estimate of surgical success. Another challenge in this study was class imbalance, especially for the individual failure criteria such as failure by follow-up surgery that only occurred in 15% of the surgeries. To account for imbalance, we weighted the loss function, optimized the classification threshold, and ran each of our optimizations on both accuracy and F1 score. Because preoperative input features such as prior medication plan, diagnoses, surgeries, and ophthalmic constants were summarized, the data also lost its temporal nature. Incorporating the temporal nature of the EHR data in the analysis to account for the evolution of the patient condition over time remains an interesting avenue for future research. Finally, our features included only structured data and did not include any image or text data. There remains an opportunity to increase the predictive power of these algorithms by including imaging data, such as optical coherence tomography, fundus photography, or visual field results. Incorporating data from free-text clinical notes using recent NLP techniques such as transformers27 or long short-term memory (LSTM) architectures28 could also yield promising results, because several previous studies have shown that such models can have significant predictive power for ophthalmic tasks.2931 
In conclusion, we used EHR data to develop machine and deep learning models to predict the outcomes of glaucoma surgery. We showed that neural network and random forest models were the most effective algorithms, demonstrating substantial predictive power on our task, especially in light of its complexity. We also applied our models to individual failure criteria and showed failure by medication or need for a follow-up surgery were significantly harder to predict than failure by IOP. Future research may be able to further improve these models by incorporating multimodal and temporal input data, ultimately enabling potential clinical application to aid glaucoma surgeons in personalizing treatment choices for glaucoma patients. 
Acknowledgments
Supported by the National Eye Institute 1K23EY03263501 (SYW); Career Development Award from Research to Prevent Blindness (SYW); unrestricted departmental grant from Research to Prevent Blindness (SYW); departmental grant National Eye Institute P30-EY026877 (SYW). The sponsors or funding organizations had no role in the design or conduct of this research. 
Disclosure: S. Barry, None; S.Y. Wang, None 
References
Tham Y-C, Li X, Wong TY, et al. Global prevalence of glaucoma and projections of glaucoma burden through 2040: a systematic review and meta-analysis. Ophthalmology. 2014; 121: 2081–2090. [CrossRef] [PubMed]
Wagner FM, Schuster AK, Kianusch K, et al. Long-term success after trabeculectomy in open-angle glaucoma: results of a retrospective cohort study. BMJ Open. 2023; 13: e068403. [CrossRef] [PubMed]
Saleh E, Błaszczyński J, Moreno A, et al. Learning ensemble classifiers for diabetic retinopathy assessment. Artif Intell Med. 2018; 85: 50–63. [CrossRef] [PubMed]
Gaskin GL, Pershing S, Cole TS, Shah NH. Predictive modeling of risk factors and complications of cataract surgery. Eur J Ophthalmol. 2016; 26: 328–337. [CrossRef] [PubMed]
Jalamangala Shivananjaiah SK, Kumari S, Majid I, Wang SY. Predicting near-term glaucoma progression: an artificial intelligence approach using clinical free-text notes and data from electronic health records. Front Med. 2023; 10: 1157016. [CrossRef]
Banna HU, Zanabli A, McMillan B, et al. Evaluation of machine learning algorithms for trabeculectomy outcome prediction in patients with glaucoma. Sci Rep. 2022; 12: 1–11. [PubMed]
Lin W-C, Chen A, Song X, Weiskopf NG, Chiang MF, Hribar MR. Prediction of multiclass surgical outcomes in glaucoma using multimodal deep learning based on free-text operative notes and structured EHR data, J Am Med Inform Assoc. 2024; 31: 456–464. [CrossRef] [PubMed]
Lowe HJ, Ferris TA, Hernandez PM, Weber SC. STRIDE—an integrated standards-based translational research informatics platform. AMIA Annu Symp Proc. 2009; 2009: 391–395. [PubMed]
World Glaucoma Association: The Global Glaucoma Network. Guidelines on Design & Reporting Glaucoma Trial. Available at: https://wga.one/wpfd_file/guidelines-on-design-reporting-glaucoma-trials/. Accessed March 21, 2024.
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011; 12: 2825–2830.
Ying GS, Maguire MG, Glynn RJ, Rosner B. Tutorial on biostatistics: receiver-operating characteristic (ROC) analysis for correlated eye data. Ophthalmic Epidemiol. 2022; 29: 117–127. [CrossRef] [PubMed]
Huang FL. Using cluster bootstrapping to analyze nested data with a few clusters. Educ Psychol Meas. 2018; 78: 297–318. [CrossRef] [PubMed]
Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017:30.
Lundberg S . shap. Github Available at: https://github.com/slundberg/shap [Accessed May 16, 2023].
Ma AK, Lee JH, Warren JL, Teng CC. GlaucoMap - Distribution of Glaucoma Surgical Procedures in the United States. Clin Ophthalmol. 2020; 14: 2551–2560. [CrossRef] [PubMed]
Landers J, Martin K, Sarkies N, et al. A twenty-year follow-up study of trabeculectomy: risk factors and outcomes. Ophthalmology. 2012; 119: 694–702. [CrossRef] [PubMed]
Edmunds B, Bunce CV, Thompson JR, et al. Factors associated with success in first-time trabeculectomy for patients at low risk of failure with chronic open-angle glaucoma. Ophthalmology. 2004; 111: 97–103. [CrossRef] [PubMed]
Fontana H, Nouri-Mahdavi K, Lumba J, et al. Trabeculectomy with mitomycin C: outcomes and risk factors for failure in phakic open-angle glaucoma. Ophthalmology. 2006; 113: 930–936. [CrossRef] [PubMed]
Chiu H-I, Su H-I, Ko Y-C, Liu CJ-L. Outcomes and risk factors for failure after trabeculectomy in Taiwanese patients: medical chart reviews from 2006 to 2017. Br J Ophthalmol. 2022; 106: 362–367. [CrossRef] [PubMed]
Issa de Fendi L, Cena de Oliveira T, Bigheti Pereira C, et al. Additive effect of risk factors for trabeculectomy failure in glaucoma patients: a risk-group from a cohort study. J Glaucoma. 2016; 25: e879–e883. [CrossRef] [PubMed]
Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016: 1135–1144).
Tseng P-Y, Chen Y-T, Wang C-H, et al. Prediction of the development of acute kidney injury following cardiac surgery by machine learning. Crit Care. 2020; 24: 478. [CrossRef] [PubMed]
Kambakamba P, Mannil M, Herrera PE, et al. The potential of machine learning to predict postoperative pancreatic fistula based on preoperative, non-contrast-enhanced CT: A proof-of-principle study. Surgery. 2020; 167: 448–454. [CrossRef] [PubMed]
Stam WT, Goedknegt LK, Ingwersen EW, et al. The prediction of surgical complications using artificial intelligence in patients undergoing major abdominal surgery: A systematic review. Surgery. 2022; 171: 1014–1021. [CrossRef] [PubMed]
Grinsztajn L, Oyallon E, Varoquaux G. Why do tree-based models still outperform deep learning on typical tabular data?. Adv Neural Inf Process Sys. 2022; 35: 507–520.
Hersh WR, Weiner MG, Embi PJ, et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med Care. 2013; 51: S30–7. [CrossRef] [PubMed]
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need. Adv Neural Inf Process Sys. 2017; 30.
Hochreiter S, Schmidhuber J. LSTM can solve hard long time lag problems. Adv Neural Inf Process Syst. 1997; 9: 473–479.
Wang SY, Huang J, Hwang H, et al. Leveraging weak supervision to perform named entity recognition in electronic health records progress notes to identify the ophthalmology exam. Int J Med Inform. 2022; 167: 104864. [CrossRef] [PubMed]
Hu W, Wang SY. Predicting Glaucoma Progression Requiring Surgery Using Clinical Free-Text Notes and Transfer Learning With Transformers. Transl Vis Sci Technol. 2022; 11: 37. [CrossRef] [PubMed]
Peissig PL, Rasmussen LV, Berg RL, et al. Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J Am Med Inform Assoc. 2012; 19: 225–234. [CrossRef] [PubMed]
Figure 1.
 
Causes of glaucoma surgical failure. The Venn diagram shows the number of surgeries that failed according to three separate failure criteria: (1) Failure by IOP, defined as postoperative reduction of less than20% compared to preoperative levels; (2) Failure by medication: Need for more classes of glaucoma medications after surgery than before; and (3) Need for additional glaucoma surgery or revision of original surgery within three months.
Figure 1.
 
Causes of glaucoma surgical failure. The Venn diagram shows the number of surgeries that failed according to three separate failure criteria: (1) Failure by IOP, defined as postoperative reduction of less than20% compared to preoperative levels; (2) Failure by medication: Need for more classes of glaucoma medications after surgery than before; and (3) Need for additional glaucoma surgery or revision of original surgery within three months.
Figure 2.
 
Receiver operating characteristic and precision-recall curves for models predicting overall glaucoma surgical failure The figures show the performance of different machine learning and deep learning models on predicting overall glaucoma surgical failure on the held-out test set. The legend indicates the model type and the area under the curve. SVM, support vector machine; MLP, multilayer perceptron; GaussianNB, Gaussian naïve Bayes.
Figure 2.
 
Receiver operating characteristic and precision-recall curves for models predicting overall glaucoma surgical failure The figures show the performance of different machine learning and deep learning models on predicting overall glaucoma surgical failure on the held-out test set. The legend indicates the model type and the area under the curve. SVM, support vector machine; MLP, multilayer perceptron; GaussianNB, Gaussian naïve Bayes.
Figure 3.
 
AUROC for models predicting overall surgical failure and individual surgical failure criteria. The bars represent the test set AUROC for each model on the individual failure criteria with the best set of hyperparameters. 95% confidence intervals are depicted using error bars. Dec. Tree, decision tree; MLP, multilayer perceptron; NB, Gaussian naïve Bayes; KNN, K-nearest neighbors; LDA, linear discriminant analysis; Log. Reg., logistic regression; SVM, support vector machine; RF, random forest; NN, neural network.
Figure 3.
 
AUROC for models predicting overall surgical failure and individual surgical failure criteria. The bars represent the test set AUROC for each model on the individual failure criteria with the best set of hyperparameters. 95% confidence intervals are depicted using error bars. Dec. Tree, decision tree; MLP, multilayer perceptron; NB, Gaussian naïve Bayes; KNN, K-nearest neighbors; LDA, linear discriminant analysis; Log. Reg., logistic regression; SVM, support vector machine; RF, random forest; NN, neural network.
Figure 4.
 
Most important features for model prediction using Shapley analysis. This figure displays the Shapley values for the top 20 most important features in predicting the outcome of a surgery, using the best-performing classical machine learning model (random forest). Each dot represents an individual in the cohort. Features are listed vertically on the Y-axis, and their ranking is determined by their importance in the predictive model. The impact of each feature on the model's prediction is described on the X-axis. A value near 0 indicates little to no impact on the prediction, whereas values further to the left or right indicate a negative or positive impact on the prediction, respectively. The color of each dot provides insight into the actual value of that feature for the individual data point (blue: low feature value; red: high feature value).
Figure 4.
 
Most important features for model prediction using Shapley analysis. This figure displays the Shapley values for the top 20 most important features in predicting the outcome of a surgery, using the best-performing classical machine learning model (random forest). Each dot represents an individual in the cohort. Features are listed vertically on the Y-axis, and their ranking is determined by their importance in the predictive model. The impact of each feature on the model's prediction is described on the X-axis. A value near 0 indicates little to no impact on the prediction, whereas values further to the left or right indicate a negative or positive impact on the prediction, respectively. The color of each dot provides insight into the actual value of that feature for the individual data point (blue: low feature value; red: high feature value).
Table 1.
 
Population Characteristics
Table 1.
 
Population Characteristics
Table 2.
 
Model Performance for Prediction of Overall Glaucoma Surgical Failure
Table 2.
 
Model Performance for Prediction of Overall Glaucoma Surgical Failure
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×