March 2023
Volume 12, Issue 3
Open Access
Artificial Intelligence  |   March 2023
PhacoTrainer: Deep Learning for Cataract Surgical Videos to Track Surgical Tools
Author Affiliations & Notes
  • Hsu-Hang Yeh
    Department of Biomedical Data Science, Stanford University, Palo Alto, CA, USA
  • Anjal M. Jain
    Department of Ophthalmology, Byers Eye Institute, Stanford University, Palo Alto, CA, USA
  • Olivia Fox
    Krieger School of Arts and Sciences, Johns Hopkins University, Baltimore, MD, USA
  • Kostya Sebov
    Department of Biomedical Data Science, Stanford University, Palo Alto, CA, USA
  • Sophia Y. Wang
    Department of Biomedical Data Science, Stanford University, Palo Alto, CA, USA
    Department of Ophthalmology, Byers Eye Institute, Stanford University, Palo Alto, CA, USA
  • Correspondence: Sophia Y. Wang, 2370 Watson Court, Palo Alto, 94303 CA, USA. e-mail: sywang@stanford.edu 
Translational Vision Science & Technology March 2023, Vol.12, 23. doi:https://doi.org/10.1167/tvst.12.3.23
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Hsu-Hang Yeh, Anjal M. Jain, Olivia Fox, Kostya Sebov, Sophia Y. Wang; PhacoTrainer: Deep Learning for Cataract Surgical Videos to Track Surgical Tools. Trans. Vis. Sci. Tech. 2023;12(3):23. https://doi.org/10.1167/tvst.12.3.23.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose: The purpose of this study was to build a deep-learning model that automatically analyzes cataract surgical videos for the locations of surgical landmarks, and to derive skill-related motion metrics.

Methods: The locations of the pupil, limbus, and 8 classes of surgical instruments were identified by a 2-step algorithm: (1) mask segmentation and (2) landmark identification from the masks. To perform mask segmentation, we trained the YOLACT model on 1156 frames sampled from 268 videos and the public Cataract Dataset for Image Segmentation (CaDIS) dataset. Landmark identification was performed by fitting ellipses or lines to the contours of the masks and deriving locations of interest, including surgical tooltips and the pupil center. Landmark identification was evaluated by the distance between the predicted and true positions in 5853 frames of 10 phacoemulsification video clips. We derived the total path length, maximal speed, and covered area using the tip positions and examined the correlation with human-rated surgical performance.

Results: The mean average precision score and intersection-over-union for mask detection were 0.78 and 0.82. The average distance between the predicted and true positions of the pupil center, phaco tip, and second instrument tip was 5.8, 9.1, and 17.1 pixels. The total path length and covered areas of these landmarks were negatively correlated with surgical performance.

Conclusions: We developed a deep-learning method to localize key anatomical portions of the eye and cataract surgical tools, which can be used to automatically derive metrics correlated with surgical skill.

Translational Relevance: Our system could form the basis of an automated feedback system that helps cataract surgeons evaluate their performance.

Introduction
Cataract is the leading cause of blindness worldwide, caused by a clouding of the natural clear lens which increases in incidence with age.1 Cataract surgery restores clear vision by replacing cloudy lenses with artificial intraocular lens implants (IOLs). With the increased life expectancy of the general population, cataract surgery has become one of the most commonly performed surgeries in the United States.2 This surgery requires delicate and precise control of surgical instruments inside the patient's eye and requires a considerable amount of surgical training. When poorly performed, cataract surgery can lead to serious postoperative complications and may require secondary surgical interventions.3 
To improve surgical skills, surgical trainees perform surgeries under the supervision of a senior surgeon and receive one-on-one, real-time feedback from supervising surgeons. However, this feedback is mainly qualitative and in the moment, and it is limited by the number of cases a trainee has the opportunity to perform, and the number of preceptors they can perform them with. In a survey about residents' perception of cataract surgery training, two-thirds felt they needed more training on surgical skills.4 Thus, a system that generates automatic, timely, objective, and quantitative feedback metrics on surgical performance could be an extremely valuable adjunct to live real-time feedback for improving ophthalmic surgical training and patients’ outcomes. 
A valuable and rich source of data suitable for automatic performance rating systems is recorded surgical videos. Cataract surgeries during residency training are often automatically recorded by the surgical microscope. Through the video, one can observe the flow of surgery, the proficiency of using surgical instruments, and the presence of complications. The use of recorded surgical video to assist evaluation of surgical skills has been adopted in many medical fields.5,6 Some previous studies have also investigated the possibility of grading cataract surgical performance using computer vision for videos.712 However, these models either only focused on a single step or generated an overall grade that lacked granular interpretability. Knowledge of the precise locations of key anatomic landmarks and surgical tools would enable the calculation of more useful motion metrics and enable more granular feedback on surgical techniques. Although advances in deep learning models have enabled recognizing surgical steps,13 detecting the presence of and segmenting objects in ophthalmology surgical videos,1416 their results have not been linked to motion analysis of surgical landmarks. 
In this study, we aimed to build a novel algorithm that identifies surgical landmarks from segmentation models and can generate multiple motion metrics related to the movement of tool tips, tool axes, and speed. To fulfill these objectives, we trained and evaluated deep learning models that identify and segment ocular anatomic structures and surgical instruments in a real-time fashion using recorded cataract surgical videos, and we developed and evaluated an automated landmark identification algorithm that identifies surgical landmarks from the segmentation masks. The locations of surgical tools and anatomic landmarks were used to construct objective metrics about tool usage patterns, fluency, and potentially harmful movements during the surgery. We examined the association between these derived metrics and human expert ratings on surgical performance. The model can form the basis of context-aware computer-assisted surgery in the future, for example, by automatically generating real-time surgical performance feedback, sending alerts when surgical instruments are in proximity to critical structures.17 
Methods
Data Source and Preprocessing
We collected 298 resident cataract surgical videos routinely recorded during the residency training of 12 surgeons across 6 different sites. All data were de-identified and this research was deemed to be exempt by the Stanford University Institutional Review Board. The original videos were downsampled to a resolution of 456 × 256. After removing incomplete recorded videos, duplicated recordings, and videos of combined surgeries, a total of 268 videos were included in the final dataset. 
A team of four trained annotators used VGG Image Annotator software18 to manually label the start and end times for nine core steps (create wounds, inject anterior chamber, continuous curvilinear capsulorhexis, hydrodissection, phacoemulsification, irrigation/aspiration, inject lens, remove viscoelastic, and close wounds), as previously described.13 Individual frames were extracted from the video using OpenCV (version 4.1.2)19 at a frame capture rate of 1 frame per second, yielding a total of 457,171 extracted frames. A total of 1156 frames were randomly sampled from these 9 core steps of cataract surgery, and the masks of the 8 different classes of surgical tools (blade, forceps, second instrument, Weck sponge, needle or cannula, phaco, irrigation/aspiration handpiece, and lens injector), the pupil border, and the limbus were manually annotated. 
The labeled frames were separated by video into train, validation, and test sets, resulting in 943, 108, and 105 images in each set. Due to the relatively small size of our training set, we pretrained our model on a public cataract surgery dataset Cataract Dataset for Image Segmentation (CaDIS),20 which contains 3550, 534, and 586 frames in the train, validation, and test sets. Among the 29 instrument classes in the CaDIS dataset, we only kept the relevant 8 instrument classes. For anatomic structure annotation, we only used the pupil and limbus. 
We prepared an additional test dataset consisting of the phacoemulsification portions of 10 random videos from our test set for downstream landmark identification algorithms and surgical skill rating, because surgical skill rating and tool usage metrics (path length, velocity, and area covered detailed below) requires evaluation on continuous video clips rather than randomly sampled frames that were previously labeled. The video segments were downsampled to 1 frame per second, resulting in a total of 5853 frames on which to evaluate landmark identification performance. Four trained annotators manually labeled the coordinates of the phacoemulsifier tips, second instrument tips, and pupil centers. 
Segmentation Model Development and Landmark Identification
An overview of the computer vision algorithm to identify surgical tools, anatomic locations of the eye, and their landmarks is provided in Figure 1. We utilized a real-time object detection and segmentation model, YOLACT (You Only Look At the CoefficienTs), to perform instance segmentation.21 The model creates “prototype” masks from a fully connected network and predicts mask coefficients for each prototype. These two arms are then “assembled” via a simple matrix multiplication and sigmoid activation. The model architecture is flexible with multiple choices of backbones. This structure allows the model to compute at 30 frames per second, making the inference occur nearly in real-time as the video is playing. In both pretraining and fine-tuning stages, the models were trained for 120,000 iterations with an initial learning rate of 0.001 and 2 exponential decays of 0.9 at 60,000, and 100,000 iterations. DarkNet22 was used as the backbone for feature extraction. The output of object detection was evaluated by the mean average precision score, which was calculated by the area under the precision-recall curve using 0.5 as the true-positive threshold. The output of segmentation was evaluated on the test set of annotated frames from the nine core steps of surgery using intersection-over-union (IoU), which was calculated by the intersection of the predicted mask and the true mask divided by the union of them. The model was trained and run on a virtual machine with 8 vCPU, 52 GB RAM, and a NVIDIA Tesla P100 GPU. 
Figure 1.
 
Overview of two-step deep learning and computer vision algorithm for identification of eye anatomy and surgical tools and their landmarks. Surgical video frames are input into the trained YOLACT model, which generated masks for surgical instruments and pupils. Contours of the masks are identified and either ellipses or lines are fitted according to the type of object. The pupil centers, the tooltips and the orientation of surgical tools can be determined from the mask contours. Combining the information across sequential frames, performance-related motion metrics can be automatically generated.
Figure 1.
 
Overview of two-step deep learning and computer vision algorithm for identification of eye anatomy and surgical tools and their landmarks. Surgical video frames are input into the trained YOLACT model, which generated masks for surgical instruments and pupils. Contours of the masks are identified and either ellipses or lines are fitted according to the type of object. The pupil centers, the tooltips and the orientation of surgical tools can be determined from the mask contours. Combining the information across sequential frames, performance-related motion metrics can be automatically generated.
In the second step, we built a computer-vision algorithm to process the output of YOLACT segmentation masks and pinpoint the location of important landmarks, including the center of the pupil and the tip position of instruments. The algorithms were implemented with the OpenCV software package 4.1.2. For each mask, we applied the following sequence of steps: (1) compute the outer contours, (2) compute a common convex hull using the outer contours, (3) fit an ellipse and note its center as the pupil center if the mask is predicted to be pupil, and (4) fit a line and predict the tool tip location as the intersection nearest the pupil center between the fitted line and the convex hull. The steps are also illustrated in Figure 1. The performance of the landmark identification algorithm was evaluated by average deviation, sensitivity, and precision, using the landmark predictions within and out of 10 pixels of true positions as true and false positives, respectively. 
Tool Usage Metrics
Using the predicted landmarks of tool tip locations and pupil centers, we calculated the following three objective metrics: (1) total path length: cumulative path length of the landmark location measured in pixels; (2) maximal instantaneous speed of the landmark location measured in pixels per frame; and (3) covered area: percentage of area that has been traveled through by the landmark location. The landmark location could be the center position of the pupil or the tip position of an instrument. 
Surgical Performance Correlation
For a proof-of-concept study to correlate automated metrics with human expert ratings of surgical skill, we focused on the six items in a subjective surgical skill assessment tool, OSACSS,23 that were relevant for phacoemulsification: (1) insertion into eye of probe and second instrument, (2) effective use and stability of the probe and second instrument, (3) nucleus sculpting or primary chop, (4) nucleus rotation and manipulation, (5) cracking or chopping with safe phacoemulsification of segments, and (6) eye positioned centrally within microscopic view. Using the 6 items in OSACSS, 3 independent board-certified, anterior segment fellowship-trained, attending-level ophthalmologists rated the 10 phacoemulsification video clips on a scale from 1 to 5, with 5 indicating the highest surgical skill. The scores of the videos are averaged over the 6 items and the 3 raters, with intraclass correlation of 0.79 overall. The correlation of the human expert scores with the three derived motion metrics calculated on the phacoemulsification video clips was examined by the Spearman rank correlation coefficients. 
Results
Examples of the predicted masks and the identified landmark for each class are illustrated in Figure 2. The model performance on object detection and segmentation is summarized in Table 1. The model achieved mean AP and IoU across different classes of objects of 0.78 and 0.82, respectively, in the test set. The segmentation performed the best for the blade, Weck sponge, and phacoemulsification probe classes of instruments, whereas performance in the needle or cannula class of instruments was the worst. The sensitivity, precision, and average deviation of estimated phaco tip position, second instrument tip position, and pupil center position from ground-truth positions are summarized in Table 2. The accuracy of the landmark identification algorithm for phacoemulsification was highest for the pupil, followed by the phacoemulsification probe and the second instrument. We found that the average size of the instruments had Pearson's correlation coefficients of 0.63 with IoU and 0.75 with mAP. 
Figure 2.
 
Examples of predicted segmentation masks, tip positions for cataract surgical tools, and center of pupils. The yellow regions represent masks predicted by YOLACT model for the object class. Red crosses indicate the point localized by the landmark identification algorithm.
Figure 2.
 
Examples of predicted segmentation masks, tip positions for cataract surgical tools, and center of pupils. The yellow regions represent masks predicted by YOLACT model for the object class. Red crosses indicate the point localized by the landmark identification algorithm.
Table 1.
 
Per-Class Mean Average Precision of the Bounding Boxes and Intersection-Over-Union of the Segmentation Masks in the Held-Out Test Images
Table 1.
 
Per-Class Mean Average Precision of the Bounding Boxes and Intersection-Over-Union of the Segmentation Masks in the Held-Out Test Images
Table 2.
 
Sensitivity, Precision, and Average Deviation of Predicted Tip Locations in Held-Out Phacoemulsification Test Videos* and Their Correlation of Derived Motion Metrics With Human-Rated Performance Scores
Table 2.
 
Sensitivity, Precision, and Average Deviation of Predicted Tip Locations in Held-Out Phacoemulsification Test Videos* and Their Correlation of Derived Motion Metrics With Human-Rated Performance Scores
The Spearman correlation coefficients among the three derived motion metrics and human expert-rated performance are also shown in Table 2. All metrics, including maximum velocity, total path length, and total area covered, demonstrated the expected negative correlation with human-rated performance scores, with total path length and covered area having larger magnitudes of correlation. Maximum velocity had a weaker correlation with performance. 
Figure 3 depicts an overlay of the predicted and true trajectories of the phaco probe tip in a randomly selected phacoemulsification segment in a test video. The predicted and true trajectories closely followed each other, demonstrating the possibility of using the predicted trajectory as a proxy for the true trajectory. An example video showing overlaying masks and model predictions is included in the Supplementary Material
Figure 3.
 
An example of predicted trajectory and the true trajectory of the phacoemulsification probe tip. Green and red lines indicate the predicted and true trajectory of the phacoemulsification probe tip from a randomly selected 50-second clip from the test videos. Coordinates are plotted every 0.5 seconds and lighter colors represent earlier frames.
Figure 3.
 
An example of predicted trajectory and the true trajectory of the phacoemulsification probe tip. Green and red lines indicate the predicted and true trajectory of the phacoemulsification probe tip from a randomly selected 50-second clip from the test videos. Coordinates are plotted every 0.5 seconds and lighter colors represent earlier frames.
Discussion
In this study, we present a deep-learning-based method to generate automated multi-tool, multistep, and multi-metric assessment of phacoemulsification surgical skills. Our results demonstrate high accuracy in tool detection and segmentation, and we showed that the segmentation model outputs provide a valuable basis for deriving useful motion metrics about surgeons’ skills. We develop ellipse- and line-fitting algorithms that are more suitable for ophthalmology instruments to estimate pupil and tip positions, respectively. Furthermore, the fast computational speed of the model offers an advantage over other segmentation models. This real-time prediction enables more powerful applications, such as context-aware computer-assisted surgery in the future. 
Semantic segmentation performed more poorly on thinner instruments, such as second instruments and needles or cannulas. Similar patterns have been observed in previous studies.15,24 A likely cause is the greater difficulty for a predicted mask to overlap with a ground truth mask for an instrument covering so few pixels. Nonetheless, the overall IoU of our model outperforms Mask R-CNN.15 Ni et al. used a model that learns pyramid attention features to address the illumination issues that cause prediction errors and reported better accuracies on semantic segmentation.24 However, their dataset is collected from a single center and potentially contains images with less heterogeneity. 
A previous surgical tool tip identification algorithm was developed for daVinci robotic thyroid surgery.25 They applied a skeletonization algorithm that draws skeletons of the mask using connected lines and detects the tip as the point on the skeleton that has the longest cumulative distance from a starting point near the edge. We developed a simpler algorithm that directly fits a line to the overall shape of the mask and finds the nearest point to the pupil center on the enclosing edge that intersects with the fitted line. This is possible because of the thin and straight nature of cataract surgical instruments, for which a line is presumably enough to represent the “skeleton.” Wider instruments, such as Weck sponges, might not have high accuracy using this approach, but at the same time, it may not be as important to detect their tip locations with high accuracy because sponges are soft tools that do not enter the eye and would not damage the eye like other sharp instruments. On the other hand, forceps, which usually covers two segmentation masks for their two arms, would need two fitted lines to identify the two tips. 
Using segmentation masks to extract tip locations has advantages over previous studies that obtained tip locations either by crowdsourcing8 or by directly predicting the locations of a single type of instrument.10,26 With the segmentation masks, our model can estimate the locations of multiple types of instruments that may be simultaneously present in a frame, and it can also infer the rotational movement of instruments by considering the orientation of the masks. We can also leverage the information from the pupils' masks to discover the interaction patterns between surgical instruments and pupils. 
The ability of our system to simultaneously identify different tools across multiple steps and rate surgical performance provides greater generalizability and easier use in actual surgical training. Previously, Balal et al. used a motion tracking software that tracks stable feature points to identify the movement of surgical instruments.27 However, each step of surgery needed separate analysis because the software does not automatically identify what instrument is being used. Similar limitations exist in other systems developed to track cataract surgical instruments. Morita et al. built a deep-learning system that identifies the tip of forceps in capsulorhexis in order to detect certain surgical complications,26 and Tian et al. built a phacoemulsification probe tracking system using a tracking-learning-detection scheme.28 In both studies, only the tip of a single instrument in a single step of surgery is labeled and tracked, making generalization to other steps more difficult and likely to require modifications to model architecture and additional training. In contrast, our model only requires one pass of training for object detection and landmark identification for multiple instruments in all of the core steps of surgery. 
Different metrics showed different variation in correlation with OSACSS scores across different tools. The covered area demonstrated a greater consistency of correlation than the other metrics, showing that stability might be valued more in experts’ rating. The total path of the instruments showed a significant correlation with surgical performance rating, which is consistent with previous findings in a wet laboratory setting.29 Longer path lengths could represent repeated failure to perform a step, longer time to complete a single step, or inappropriate centering of the pupil that leads to a wider range of movement of the tools. This can be further analyzed at a more granular level using other motion metrics in future studies. For example, a long path length with a small covered area means the surgeon might spend more time completing a task but keep the pupil relatively fixed around the center. 
Our results have broad translational relevance for surgical training, surgery monitoring, and computer-assisted surgery. For example, surgical trainees may more easily tabulate the progress of relevant motion metrics during their training with the help of an artificial intelligence powered automated surgical metrics system. Such automated feedback can help the residents decide which surgical skills they need more practice with. This model could also form the backbone of a surgical monitoring system, in which the real-time calculation of the metrics could detect abnormal behaviors and alert the operating surgeon. A real-time system developed to guide surgical instruments in cranial surgery and alert surgeons if the instruments were close to critical structures was found to reduce the mental demand, effort and frustration related to surgery.17 
Our study has several limitations. First, the tool tip identification algorithm may be less ideal in certain circumstances, such as for tools with a wide tip or tools with multiple tips. Customizing the tip identification algorithm for the particular tool type could increase the accuracy of the prediction in these situations. Second, whereas some videos in our set included complex surgical techniques, such as trypan blue injection and iris manipulation, their numbers were in the minority and thus the model was focused on commonly used instruments which appear in many frames in the standard steps of surgery; future work could incorporate additional instruments or investigate model generalizability to advanced steps of surgery. Due to the anonymous data collection process, additional limitations include lack of information on the training year of the surgeon and lack of information on patients’ clinical outcomes. Further validation is needed to evaluate the correlation between automated surgical skill assessments and visual outcomes. Finally, a robust evaluation of model performance on different race/ethnic groups with varying iris colors and skin textures would be ideal to avoid the bias of artificial intelligence that has been widely documented in the field previously.30 
In conclusion, deep learning models are capable of accurately predicting the presence and localization of important components of eye anatomy and surgical tools in cataract surgical videos, enabling downstream analysis that utilizes motion metrics to infer a surgeon's surgical performance. 
Acknowledgments
The authors thank Karen Christopher and Ann Shue for their help in rating the surgical performance of the test videos. 
Funded by the McCormick Gabilan Fellowship from Stanford University, NEI 1K23EY03263501, Career Development Award from Research to Prevent Blindness, unrestricted departmental grants from Research to Prevent Blindness, the National Eye Institute P30-EY026877. This project utilized credits for Google Cloud Platform for computing. The funders had no role in the study design, conduct, or decision to publish. 
Disclosure: H.-H. Yeh, None; A.M. Jain, None; O. Fox, None; K. Sebov, None; S.Y. Wang, None 
References
Hashemi H, Pakzad R, Yekta A, et al. Global and regional prevalence of age-related cataract: A comprehensive systematic review and meta-analysis. Eye. 2020; 34: 1357–1370. [CrossRef] [PubMed]
Cullen KA, Hall MJ, Golosinskiy A. Ambulatory surgery in the United States, 2006. Natl Health Stat Report. 2009; 11: 1–25.
Terveen D, Berdahl J, Dhariwal M, Meng Q. Real-world cataract surgery complications and secondary interventions incidence rates: An analysis of US medicare claims database. J Ophthalmol. 2022; 2022: 8653476. [CrossRef] [PubMed]
McDonnell PJ, Kirwan TJ, Brinton GS, et al. Perceptions of recent ophthalmology residency graduates regarding preparation for practice. Ophthalmology. 2007; 114: 387–391. [CrossRef] [PubMed]
Funke I, Mees ST, Weitz J , Speidel S. Video-based surgical skill assessment using 3D convolutional neural networks. International Journal of Computer Assisted Radiology and Surgery. 2019; 14: 1217–1225. [CrossRef] [PubMed]
Ghasemloonia A, Maddahi Y, Zareinia K, Lama S, Dort JC, Sutherland GR. Surgical skill assessment using motion quality and smoothness. J Surg Educ. 2017; 74: 295–305. [CrossRef] [PubMed]
Zhu J, Luo J, Soh JM, Khalifa YM. A computer vision-based approach to grade simulated cataract surgeries. Machine Vision and Applications. 2015; 26: 115–125. [CrossRef]
Kim TS, O'Brien M, Zafar S, Hager GD, Sikder S , Vedula SS. Objective assessment of intraoperative technical skill in capsulorhexis using videos of cataract surgery. Int J Comput Assist Radiol Surg. 2019; 14: 1097–1105. [CrossRef] [PubMed]
Zafar S, Vedula S, Sikder S. Objective assessment of technical skill targeted to time in cataract surgery. J Cataract Refract Surg. 2020; 46: 705–709. [CrossRef] [PubMed]
Hira S, Singh D, Kim TS, et al. Video-based assessment of intraoperative surgical skill. Int J Comput Assist Radiol Surg. 2022; 17(10): 1801–1811. [CrossRef] [PubMed]
Wang T, Xia J, Li R, et al. Intelligent cataract surgery supervision and evaluation via deep learning. Int J Surg. 2022; 104: 106740. [CrossRef] [PubMed]
Tabuchi H, Morita S, Miki M, Deguchi H , Kamiura N. Real-time artificial intelligence evaluation of cataract surgery: A preliminary study on demonstration experiment. Taiwan J Ophthalmol. 2022; 12: 147–154. [CrossRef] [PubMed]
Yeh H-H, Jain AM, Fox O , Wang SY. PhacoTrainer: A multicenter study of deep learning for activity recognition in cataract surgical videos. Transl Vision Sci Technol. 2021; 10: 23. [CrossRef]
Hajj HA, Lamard M, Charrière K, Cochener B, Quellec G. Surgical tool detection in cataract surgery videos through multi-image fusion inside a convolutional neural network. Presented at the 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2017. pp. 2002–2005.
Fox M, Taschwer M , Schoeffmann K. Pixel-based tool segmentation in cataract surgery videos with mask R-CNN. Presented at the 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), 2020. pp. 565–568.
Matton N, Qalieh A, Zhang Y, et al. Analysis of cataract surgery instrument identification performance of convolutional and recurrent neural network ensembles leveraging BigCat. Transl Vis Sci Technol. 2022; 11: 1. [CrossRef] [PubMed]
Dixon BJ, Daly MJ, Chan H, Vescan A, Witterick IJ , Irish JC. Augmented real-time navigation with critical structure proximity alerts for endoscopic skull base surgery. Laryngoscope. 2014; 124: 853–859. [CrossRef] [PubMed]
Dutta AA, Zisserman A. The VIA annotation software for images, audio and video. Presented at the Proceedings of the 27th ACM International Conference on Multimedia, 2019. p. 4.
Grammatikopoulou M, Flouty E, Kadkhodamohammadi A, et al. CaDIS: Cataract dataset for surgical RGB-image segmentation. Med Image Anal. 2021; 71: 102053. [CrossRef] [PubMed]
Grammatikopoulou M, Flouty E, Kadkhodamohammadi A, et al. CaDIS: Cataract dataset for image segmentation. arXiv. 2019. pp. 9157–9166. Available at https://arxiv.org/abs/1906.11586.
Bolya D, Zhou C, Xiao F, Lee YJ. Yolact: Real-time instance segmentation. Proceedings of the IEEE/CVF international conference on computer vision, 2019. pp. 9157–9166. Available at: https://arxiv.org/abs/1904.02689.
Saleh GM, Gauba V, Mitra A, Litwin AS, Chung AK, Benjamin L. Objective structured assessment of cataract surgical skill. Arch Ophthalmol. 2007; 125: 363–366. [CrossRef] [PubMed]
Ni Z-L, Zhou X-H, Wang G-A, et al. SurgiNet: Pyramid attention aggregation and class-wise self-distillation for surgical instrument segmentation. Med Image Anal. 2022; 76: 102310. [CrossRef] [PubMed]
Lee D, Yu HW, Kwon H, Kong H-J, Lee KE, Kim HC. Evaluation of surgical skills during robotic surgery by deep learning-based multiple surgical instrument tracking in training and actual operations. J Clin Med. 2020; 9: 1964. [CrossRef] [PubMed]
Morita S, Tabuchi H, Masumoto H, Tanabe H, Kamiura N. Real-time surgical problem detection and instrument tracking in cataract surgery. J Clin Med. 2020; 9:3896.
Balal S, Smith P, Bader T, et al. Computer analysis of individual cataract surgery segments in the operating room. Eye (Lond). 2019; 33: 313–319. [CrossRef] [PubMed]
Tian S, Yin X-C, Wang Z-B, Zhou F, Hao H-W. A video-based intelligent recognition and decision system for the phacoemulsification cataract surgery. Computational Mathematical Meth Med. 2015; 2015: 202934. [CrossRef]
Saleh GM, Gauba V, Sim D, Lindfield D, Borhani M , Ghoussayni S. Motion analysis as a tool for the evaluation of oculoplastic surgical skill: Evaluation of oculoplastic surgical skill. Arch Ophthalmol. 2008; 126: 213–216. [CrossRef] [PubMed]
Guo LN, Lee MS, Kassamali B, Mita C , Nambudiri VE. Bias in, bias out: Underreporting and underrepresentation of diverse skin types in machine learning research for skin cancer detection-A scoping review. J Am Acad Dermatol. 2022; 87: 157–159. [CrossRef] [PubMed]
Figure 1.
 
Overview of two-step deep learning and computer vision algorithm for identification of eye anatomy and surgical tools and their landmarks. Surgical video frames are input into the trained YOLACT model, which generated masks for surgical instruments and pupils. Contours of the masks are identified and either ellipses or lines are fitted according to the type of object. The pupil centers, the tooltips and the orientation of surgical tools can be determined from the mask contours. Combining the information across sequential frames, performance-related motion metrics can be automatically generated.
Figure 1.
 
Overview of two-step deep learning and computer vision algorithm for identification of eye anatomy and surgical tools and their landmarks. Surgical video frames are input into the trained YOLACT model, which generated masks for surgical instruments and pupils. Contours of the masks are identified and either ellipses or lines are fitted according to the type of object. The pupil centers, the tooltips and the orientation of surgical tools can be determined from the mask contours. Combining the information across sequential frames, performance-related motion metrics can be automatically generated.
Figure 2.
 
Examples of predicted segmentation masks, tip positions for cataract surgical tools, and center of pupils. The yellow regions represent masks predicted by YOLACT model for the object class. Red crosses indicate the point localized by the landmark identification algorithm.
Figure 2.
 
Examples of predicted segmentation masks, tip positions for cataract surgical tools, and center of pupils. The yellow regions represent masks predicted by YOLACT model for the object class. Red crosses indicate the point localized by the landmark identification algorithm.
Figure 3.
 
An example of predicted trajectory and the true trajectory of the phacoemulsification probe tip. Green and red lines indicate the predicted and true trajectory of the phacoemulsification probe tip from a randomly selected 50-second clip from the test videos. Coordinates are plotted every 0.5 seconds and lighter colors represent earlier frames.
Figure 3.
 
An example of predicted trajectory and the true trajectory of the phacoemulsification probe tip. Green and red lines indicate the predicted and true trajectory of the phacoemulsification probe tip from a randomly selected 50-second clip from the test videos. Coordinates are plotted every 0.5 seconds and lighter colors represent earlier frames.
Table 1.
 
Per-Class Mean Average Precision of the Bounding Boxes and Intersection-Over-Union of the Segmentation Masks in the Held-Out Test Images
Table 1.
 
Per-Class Mean Average Precision of the Bounding Boxes and Intersection-Over-Union of the Segmentation Masks in the Held-Out Test Images
Table 2.
 
Sensitivity, Precision, and Average Deviation of Predicted Tip Locations in Held-Out Phacoemulsification Test Videos* and Their Correlation of Derived Motion Metrics With Human-Rated Performance Scores
Table 2.
 
Sensitivity, Precision, and Average Deviation of Predicted Tip Locations in Held-Out Phacoemulsification Test Videos* and Their Correlation of Derived Motion Metrics With Human-Rated Performance Scores
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×