Previous studies using photography to grade eyelids for trachoma have shown reasonably good correlation with field grading.
4,7–9 However, some studies have had difficulties with high rates of ungradable images from 11% to 78%.
5,9,10 Commonly reported factors contributing to images being ungradable included improper focus on the grading area, inadequate coverage of the grading area, excess light reflection, or shadows obscuring the grading area. Quality assessment of photographs at an early stage in photographer training and surveying may help prevent data loss and wasted effort in the field, but there is no standardized assessment for quality of images of the tarsal conjunctiva. In this study, we defined a series of quality metrics to use when grading the quality of images of the everted upper eyelid. This detailed assessment was quite time-consuming to implement to determine the degree of quality for each metric, requiring almost 5 minutes per image. A simplified overall quality grading scheme was then developed, which was much easier to use and had reasonable inter-grader agreement. Such a scheme could be rapidly deployed to measure the quality of images and provide feedback to photographers. The more detailed scheme might be used where feedback on the reasons for poor quality is needed. If the grade of images is the primary end point for a survey or a research study, then quality assessment can be built in at the outset by the review of images by the photograph graders. The simple method for quality assessment could also be taught to photography supervisors who could review a sample of images each day during a survey to provide feedback. The timing of review is critical for the training utility of assessing quality, as it is not helpful if performed near the end of the survey and the quality is found lacking.
We applied the detailed metric of quality assessment in the context of a previous survey where we had both field and image grades for TF to determine if there was a particular feature of the image that might explain the mismatch in grades between the field and image grade for TF. The rate of ungradable images was very low, 0.2% of eyes. Overall, the rate of mismatch eyes was very low as well, 177 (3.3%) of 5417 images. There was generally good agreement on the absence of TF by both field and image graders. However, in the possible presence of TF, we found that the rate of assigning a grade of TF was not equal between the field grader and the image graders in the same eyes. As
Figure 2 shows, in the 333 total eyes where the field grader found TF, 39% (130) of those images were not called TF by photograph graders. In contrast, of the 250 total eyes where the photograph graders found TF, only 19% (47) were not also called TF in the field.
We sought to determine the role of image quality as a reason for the mismatch between field and image grades. Because the possible presence of TF may be a confounder in assessing quality, we stratified the random sample of comparator eyes (i.e. eyes where the field and photograph grades for TF matched) by the presence of TF to be certain that close to half of the sample had TF. In fact, the overall image quality was high in both the samples of eyes where the field grades agreed with image grades and the full sample of mismatched eyes. The most common problem overall was blanching, caused by prolonged eversion of the eyelid, which makes ascertainment of follicles difficult. Blanching of greater than 10% of the grading area of the upper eye lid occurred in 7% of the mismatched eyes compared to 5% of the matched eyes, a difference that was not significant. We note that with the low rate of image quality issues, coupled with the small sample of mismatched eyes, we had limited power to detect significant differences in quality. We argue that even had the sample size been larger, it is not clear from our data that quality issues explain much of the difference found between field and image grades.
However, the presence of inflammation, as graded on images, does appear to explain some of the differences between the matched and mismatched eyes. The matched eyes were more likely to have no or mild inflammation on image review, whereas the mismatched eyes were more likely to have higher-grade inflammation. Inflammation can be severe enough to obscure follicles, which would impact the assessment of the number of follicles, as well as cause encroachment of tissue around the follicle, leading to apparent diminution in size. An analysis of the impact of inflammation on the reasons for the mismatch suggested that at least an effect on the size of the follicles might be an issue. An example of such a problem may be seen in
Figure 5, which was called TF by the field grader but not TF by the photograph graders.
Currently, the sign of TI is not included in the assessment of active trachoma, which relies solely on the sign of TF. It is not clear if graders compensate for the presence of TI by lowering the threshold for the size of follicles or by presuming that if three or four follicles are visible that more may be hidden under the inflamed tissue. The impact of inflammation on field grading and image grading needs further discussion because, although inflammation may be a relatively rare sign in general, it was present in close to half of the eyes where mismatch of TF grades occurred. The argument that TI cannot be graded reliably, at least for images, was not the case in this study, and agreement on grading TI in the field has also reportedly been very good.
4
If the field grader had been more likely to compensate for the presence of inflammation than the image graders, we would have expected much higher rates of inflammation in the 130 eyes where only the field grader called TF than in the eyes where only the image graders called TF. However, the rates of grade three inflammation were not different between these two groups. The rate of inflammation in the 130 eyes called TF by only the field grader was 19.2%, compared to 14.9% in the eyes called TF by only the image graders. Although there was some indication of a higher rate in mismatch eyes with field grades of TF, the difference was not large and not statistically significant.
Without knowledge of the direction of mismatch (i.e. whether the field or photograph graders called TF a mismatch eye), the photograph graders were asked to re-review the 177 mismatched images again and speculate on the possible reasons for mismatch. In particular, was it likely that the size of the follicles was an issue, or the number of follicles, or both reasons? If they could not discern an obvious reason, this was also noted. There are obvious limitations to this approach, primarily because the determination of reason for mismatch was made solely by the photograph graders without the thought process of the field grader and a re-examination of the eye in the field was not possible, which may have allowed for reconciliation of the image and field grades. The determination was based solely on the image, which, for example, could have had small areas of glare that obscured a follicle. Similarly, variations in follicle shape may have contributed to some ambiguity. Mismatched field and TF grades may have also represented recording errors in the field. Our data suggested that the problem of determining the number of follicles present was the primary reason the field grader called TF when the image graders did not. It is tempting to assume that the field grade should be the “gold standard,” as it represented review of the actual everted lid that could be assessed from multiple angles, whereas the photograph graders had only the image to assess. However, there is a risk of overcalling TF in situations where the rate of TF is low, as was the case in this survey, so we cannot entirely rule out field grader error.
11,12 The data suggest that a difference in determination of the size of the follicles was likely a more common reason for the photograph graders to call TF when the field grader did not. For instance, the lid flipper in the survey did not have a thumb marker to assist in determining size, whereas photograph graders can standardize size and account for magnification with a ruler. Overall, the exercise of determining potential reason for mismatch was useful for at least two reasons. First, it points to the difficulty of categorizing borderline cases in a survey, which must be included in any live training of field graders and for training of image graders.
13 Second, the findings again highlight that when comparing field and photograph grades of the same eye, we should not assume the field grade is the gold standard. In our data, we had photographic evidence of eyes with five follicles of the correct size that were not called TF by the field grader.
There are some limitations to this study, in addition to those noted above. The quality of the images in the survey was overall very good, and very few were ungradable. We have historically used a well-trained photographer using an SLR camera for our surveys and recognize that comparisons using other camera systems in other surveys may not yield the same result. Thus, quality of images may be a more important factor in producing grading mismatches than we were able to discern from our dataset. However, by monitoring the quality of images using the simple quality assessment scheme, where we demonstrated reliable agreement, image quality should be enhanced in general. The higher rate of mismatches where the field graded TF may also be a function of the overall low rate of TF prevalence in the survey, where the cases may not be as severe. Where TF prevalence is high and the cases are more florid, there may be less mismatch. Ideally, we would have had multiple field graders for each eye for this study to help clarify the field grade, as we did for the image grade.
In summary, we developed a useful tool to provide a rapid and reliable assessment of the quality of images of the upper tarsal conjunctiva that can be used to monitor photographers in the field. If the image quality is substandard, a more detailed assessment to determine the precise issue and institute re-training can be undertaken. Whereas we initially thought that the quality of the image, or metrics of quality, might be lower in eyes that had mismatched field and image TF grades, that was in fact not the case. A more significant problem was the presence of inflammation in the eye, a physical sign which affects both field and photograph grading. Training of both image and field graders needs to be standardized to either ignore inflammation or provide some accommodation in terms of follicle presence and size. Ideally, reconsideration of including the sign of TI in the determination of active trachoma might be valuable, as it does provide some information on ocular disease in the presence of TF
14 and can be graded reliably.