Reliability might be the most important prerequisite to define a surrogate marker for patient assessment and future interventional clinical trials. Rather, low interreader agreement was found in the detection of EZ loss and RPE loss. Reliability of size and location of both feature annotations, however, were distinctly higher, while ICC did not reach levels of previously published data (0.75 for RPE loss).
45 However, the latter used another OCT device (Spectralis HRA-OCT; Heidelberg Engineering, Heidelberg, Germany) that might have led to better image quality. Some of the differences between readers might be due to inaccurate delineation of lesion borders since loss and attenuation of RPE and/or EZ might merge (
Fig. 1). Interestingly, the average relative difference between two readers for RPE loss was indicated with 72.4, which was significantly higher than the CV (44.8) in our study, while both measures are thought to be independent of lesion size. Concerning HRF, the variable number might derive from the size of the feature. Readers might have simply overlooked small features, leading to not more than moderate reliability (
Fig. 1). As these features with low interrater agreement might be inherently problematic for humans to detect and quantify on OCT images, their utility as surrogate markers in clinical studies is limited. In this context, an automated artificial intelligence–based feature detection is likely to be more consistent and precise in performance than human graders.
24,46,47 The application of deep learning and its broader family, ML, might be a way forward in utilizing the utility of these potential surrogate markers. However, the ML algorithms are trained and the performance is judged by the human “gold standard,”
48 which, if unreliable, may be problematic. Different approaches try to assess this problem: (1) Prerequisites for reliable gradings are precise definitions and grading protocols as well as proper annotation platforms (respectively, software environments). (2) Training a ML algorithm on gradings from multiple graders could converge these gradings to an average grader, which would mitigate part of the subjectivity.
49 (3) A consensus grading (e.g., from a consensus meeting or by averaging gradings or by adjudicating inconsistencies) might be considered “superhuman” (i.e., better than a single grader). This superhuman grading could be used to develop a model that produces results at the same quality.
50 (4) The use of additional data (e.g., other modalities or follow-up images) may allow for improved grading.
51 (5) By using super-quality imaging (e.g., higher-resolution OCT), more reliable gradings might be obtained, which could then be transferred to standard-quality imaging for model development.
52 Moreover, ML is likely to be the only way to quantitate large volumes of dense OCT raster scans that are being generated in clinical trial reading centers, busy clinical practices, and emerging home/remote OCT devices.
53