We collected 298 resident cataract surgical videos routinely recorded during the residency training of 12 surgeons across 6 different sites. All data were de-identified and this research was deemed to be exempt by the Stanford University Institutional Review Board. The original videos were downsampled to a resolution of 456 × 256. After removing incomplete recorded videos, duplicated recordings, and videos of combined surgeries, a total of 268 videos were included in the final dataset.
A team of four trained annotators used VGG Image Annotator software
18 to manually label the start and end times for nine core steps (create wounds, inject anterior chamber, continuous curvilinear capsulorhexis, hydrodissection, phacoemulsification, irrigation/aspiration, inject lens, remove viscoelastic, and close wounds), as previously described.
13 Individual frames were extracted from the video using OpenCV (version 4.1.2)
19 at a frame capture rate of 1 frame per second, yielding a total of 457,171 extracted frames. A total of 1156 frames were randomly sampled from these 9 core steps of cataract surgery, and the masks of the 8 different classes of surgical tools (blade, forceps, second instrument, Weck sponge, needle or cannula, phaco, irrigation/aspiration handpiece, and lens injector), the pupil border, and the limbus were manually annotated.
The labeled frames were separated by video into train, validation, and test sets, resulting in 943, 108, and 105 images in each set. Due to the relatively small size of our training set, we pretrained our model on a public cataract surgery dataset Cataract Dataset for Image Segmentation (CaDIS),
20 which contains 3550, 534, and 586 frames in the train, validation, and test sets. Among the 29 instrument classes in the CaDIS dataset, we only kept the relevant 8 instrument classes. For anatomic structure annotation, we only used the pupil and limbus.
We prepared an additional test dataset consisting of the phacoemulsification portions of 10 random videos from our test set for downstream landmark identification algorithms and surgical skill rating, because surgical skill rating and tool usage metrics (path length, velocity, and area covered detailed below) requires evaluation on continuous video clips rather than randomly sampled frames that were previously labeled. The video segments were downsampled to 1 frame per second, resulting in a total of 5853 frames on which to evaluate landmark identification performance. Four trained annotators manually labeled the coordinates of the phacoemulsifier tips, second instrument tips, and pupil centers.