To date, the majority of published annotated datasets of free-text notes have been focused on de-identification. The largest corpora of publicly available notes from the EHR have focused on annotations of de-identifying PHI and include the i2b2,
28,33 MIMIC-III,
34 and OpenDeID
35 datasets. These datasets contain 1304 office visit notes across many specialties, visit notes for nearly 40,000 patients admitted to the critical care unit, and 2100 pathology reports, respectively. Other large clinical notes datasets include 3503 visit notes annotated for PHI published by Deleger et al.,
36 though this dataset is not publicly available, and the TREC Medical Records corpora from 2011 and 2012,
37 consisting of 93,351 unannotated visit notes across several specialties. Similarly, annotated datasets exist for family history of disease,
38 words useful in inflammatory bowel disease evaluation,
39 part-of-speech tagging,
40 and comprehensive annotation (e.g., anatomy, drugs and dosages, signs and symptoms)
41; however, none of these datasets is publicly available. Several of these aforementioned datasets have been most notably used for developed de-identification algorithms, particularly for PHI.
27,33,42–44 Additional use cases for these datasets include big data retrospective analyses for risk factors,
45,46 analysis of documentation similarity,
47 and text extraction for fungal endophthalmitis,
48 among others. Our dataset complements existing de-identified datasets and offers a new avenue of exploration with annotated ophthalmic medication data.