Properly curated data sets are extremely valuable. Why? Because much work must be done to create a data set that can serve as a “laboratory” for hypothesis testing and outcome prediction. Raw data from an electronic health record, for example, must be categorized according to relevant attributes (e.g. height, weight, age, gender, blood pressure, visual acuity, intraocular pressure, cup/disc ratio, central foveal thickness, sensitivity to the size III test target within 10 degrees of fixation, etc.). Having been so structured, inconsistencies, such as missing values and incorrect data (e.g. a physiologically impossible temperature such as 4000 degrees Fahrenheit or an intraocular pressure of -100 mm Hg) must be identified and rectified (“data cleaning”) as a prelude to data analysis. These curated data sets are valuable because they can be interrogated for a variety of purposes, not just those intended by the scientists that assembled and curated the data set.
4,5 The results of phase III randomized clinical trials comprise a valuable data set that, if available publicly, can accelerate hypothesis testing regarding disease pathogenesis (e.g. by analyzing the genetic background of enrolled patients)
6 or that might be used to identify subgroups of patients who are exceptionally resistant/responsive to therapy
7 or who are at high risk for severe complications from therapy (e.g. cerebrovascular accidents).
8,9 The availability of phase III trial data also affords an opportunity for independent investigators to reproduce the results reported by the original investigative team.
Unfortunately, investigators often cannot obtain access to high-quality datasets in their area of interest. In part, this obstacle may arise because the scientists who generate data sets may feel there is little incentive to share the data. Furthermore, when available, data sets may be annotated poorly or organized inconsistently. If different data sets for similar disease processes are structured consistently, then they may be combined for subsequent analysis. We believe that vision science will benefit from harmonized methods for data representation. Properly annotated data sets can be used to develop and validate new artificial intelligence algorithms.