Open Access
Artificial Intelligence  |   January 2025
Conjunctival Bulbar Redness Extraction Pipeline for High-Resolution Ocular Surface Photography
Author Affiliations & Notes
  • Philipp Ostheimer
    Institute of the Electrical and Biomedical Engineering, UMIT TIROL - Private University for Health Sciences and Health Technology, Hall in Tyrol, Austria
  • Arno Lins
    Department of Research and Development, Occyo GmbH, Innsbruck, Austria
  • Lars Albert Helle
    Institute of the Electrical and Biomedical Engineering, UMIT TIROL - Private University for Health Sciences and Health Technology, Hall in Tyrol, Austria
  • Vito Romano
    Department of Medical and Surgical Specialities, Radiological Sciences and Public Health, University of Brescia, Brescia, Italy
  • Bernhard Steger
    Department of Ophthalmology and Optometry, Medical University of Innsbruck, Innsbruck, Austria
  • Marco Augustin
    Department of Research and Development, Occyo GmbH, Innsbruck, Austria
  • Daniel Baumgarten
    Institute of the Electrical and Biomedical Engineering, UMIT TIROL - Private University for Health Sciences and Health Technology, Hall in Tyrol, Austria
    Department of Mechatronics, University of Innsbruck, Innsbruck, Austria
  • Correspondence: Philipp Ostheimer, Department of Biomedical Computer Science and Mechatronics, UMIT TIROL, Eduard-Wallnöfer-Zentrum 1, 6060 Hall in Tyrol, Austria. e-mail: [email protected] 
  • Daniel Baumgarten, Department of Mechatronics, University Innsbruck, Technikerstraße 13, 6020 Innsbruck, Austria. e-mail: [email protected] 
Translational Vision Science & Technology January 2025, Vol.14, 6. doi:https://doi.org/10.1167/tvst.14.1.6
  • Views
  • PDF
  • Share
  • Tools
    • Alerts
      ×
      This feature is available to authenticated users only.
      Sign In or Create an Account ×
    • Get Citation

      Philipp Ostheimer, Arno Lins, Lars Albert Helle, Vito Romano, Bernhard Steger, Marco Augustin, Daniel Baumgarten; Conjunctival Bulbar Redness Extraction Pipeline for High-Resolution Ocular Surface Photography. Trans. Vis. Sci. Tech. 2025;14(1):6. https://doi.org/10.1167/tvst.14.1.6.

      Download citation file:


      © ARVO (1962-2015); The Authors (2016-present)

      ×
  • Supplements
Abstract

Purpose: To extract conjunctival bulbar redness from standardized high-resolution ocular surface photographs of a novel imaging system by implementing an image analysis pipeline.

Methods: Data from two trials (healthy; outgoing ophthalmic clinic) were collected, processed, and used to train a machine learning model for ocular surface segmentation. Various regions of interest were defined to globally and locally extract a redness biomarker based on color intensity. The image-based redness scores were correlated to clinical gradings (Efron) for validation.

Results: The model to determine the regions of interest was verified for a segmentation performance, yielding mean intersections over union of 0.9639 (iris) and 0.9731 (ocular surface). All trial data were analyzed and a digital grading scale for the novel imaging system was established. Photographs and redness scores from visits weeks apart showed good feasibility and reproducibility. For scores within the same session, a mean coefficient of variation of 4.09% was observed. A moderate positive Spearman correlation (0.599) was found with clinical grading.

Conclusions: The proposed conjunctival bulbar redness extraction pipeline demonstrates that by using standardized imaging, a segmentation model and image-based redness scores' external eye photography can be classified and evaluated. Therefore, it shows the potential to provide eye care professionals with an objective tool to grade ocular redness and facilitate clinical decision-making in a high-throughput manner.

Translational Relevance: To empower clinicians and researchers with a high-throughput workflow by standardized imaging combined with an analysis tool based on artificial intelligence to objectively determine an image-based redness score.

Introduction
Red eyes, also known as bulbar conjunctival hyperemia, are a typical indicator of various ocular disorders or conditions. The enlargement of blood vessels leads to increased conjunctival bulbar redness and can be caused by either systemic disease or acute and chronic ocular inflammatory or infectious conditions, including dry eye disease, ocular allergies, infectious keratitis, or increased eye pressure.1,2 In clinical practice and clinical research, various image-based reference scales have been used to grade eye redness since the 1980s. The use of these scales comes with different challenges and pitfalls like significant interobserver and intraobserver variation, because of the diversity and interchangeability of the many scales used in the field.3,4 Examples of this diversity range from using artist-rendered illustrations like in the Efron scale4 or photographs like in the digital bulbar redness (DBR) scale5 to a large variety for the number of grades.3,6,7 
In most cases, the grading is done in person or from recordings generated by using imaging devices, such as slit-lamp camera systems.8 The problem with grading the recordings is the variation in image quality owing to the vast amount of configurations in slit-lamp systems, as well as its operator dependency. Producing standardized, high-quality recordings in terms of reproducibility and operator independence is crucial for digital services driving the transformation of modern health care. These data serve to empower cutting-edge medical image analysis tools, facilitating the training of state-of-the-art artificial intelligence (AI) models leveraging deep convolutional neural networks. The tools can support various aspects of health care, including diagnosis, triage, and detection, ultimately enabling enhanced clinical decision-making. Since 2019, there have been many large review papers highlighting the recent advances for AI in ophthalmology,914 whereas the main focus of the applications lied often on diseases with high incidences. Examples are diabetic retinopathy and age-related macular degeneration, which are clinically assessed using optical coherence tomography and fundus photography.13 These modalities are nowadays clinically standardized, which contributed to the emergence of various tools around them. The lack of standardization for anterior eye imaging technologies may be the cause of the less popular application of AI for the anterior eye. Nevertheless, recently there has been also a boost in AI algorithms addressing anterior segment diseases and imaging, which are summarized in three recent review articles.11,12,14 
Described methods to assess the bulbar redness evaluate intensity features like color- and texture-based features like blood vessels.1518 For this purpose, traditional image processing methods like Canny edge detection filters can be used to detect the edges of blood vessels.18 In contrast, AI applications are designed for this task. Brea et al. discussed the challenges of feature engineering for bulbar redness like class imbalance and correlated features and trained a multilayer perceptron for this task.16 They also pointed out that relating the redness to the clinical grading scales is challenging owing to their diversity. Another possibility to evaluate the bulbar redness is by image classification as shown by Li et al.’s work.19 They developed a deep learning framework consisting of a segmentation and classification network for the classification of bulbar conjunctival hyperemia severity in four grades by using an approach called mask distillation, which uses prior knowledge of the area of interest. Another example is the work from Wang et al., which recently proposed the DeepORedNet, a contrastive learning-based attention-weighted dual channel residual network to classify the redness in four levels.20 In one of our previous works, we developed a baseline pipeline for healthy eye data, where a random forest classifier was used on the subsampled images to identify the areas for the extraction of ocular surface redness.21 
This work aims to establish an advanced pipeline for the extraction of conjunctival bulbar redness from high-resolution ocular surface photographs of a novel device based on healthy and pathological cases. This new pipeline includes a deep learning model to segment the region of the visible bulbar conjunctiva with the major advantage of eliminating the need for feature engineering and making it easier to process more complex data, such as eyes with present pathologies. Using the resulting segmentations, different regions can be defined to extract the conjunctival bulbar redness not only globally, but also locally (e.g., nasal and temporal). Additionally, it enables the possibility to examine the previously used subsampling approach to get even more location-based redness information or variation of redness within one eye. The proposed pipeline can help clinical experts with their decision-making by providing objective redness scores and moving toward a high-throughput assessment of this crucial ocular surface biomarker. The main contributions of this work are: (i) establishing a data analysis pipeline and a dataset comprising more than 500 high-quality images of healthy and pathological cases, recorded with a novel ocular surface photography system22,23; (ii) establishing a segmentation model based on the dataset capable of accurately extracting the ocular surface and the iris from the photographs; (iii) extracting intensity-based redness from the conjunctival regions globally or locally, therewith completing the pipeline; and (iv) thorough validation of the proposed pipeline against clinical redness scores and potential clinical settings for high-throughput redness grading. 
Methods
Pipeline for Conjunctival Bulbar Redness Extraction
Our proposed pipeline to extract conjunctival bulbar redness is concisely visualized in Figure 1. For a selected photograph, the steps are as follows: 
  • Predict the segmentation of the iris and the ocular surface using a pretrained image segmentation model (e.g., YOLOv8-seg).
  • Use the resulting prediction information to determine a region of interest (ROI) and extract the redness in this region.
  • Evaluate the resulting redness scores with consideration of clinical interpretability.
Figure 1.
 
Proposed pipeline for extracting the conjunctival bulbar redness, where an image-based redness is determined for different ROIs, which result from using a segmentation model. The redness is evaluated and related to clinical grading.
Figure 1.
 
Proposed pipeline for extracting the conjunctival bulbar redness, where an image-based redness is determined for different ROIs, which result from using a segmentation model. The redness is evaluated and related to clinical grading.
In the upcoming subsections, the individual parts of the pipeline are described in greater detail. 
Data Collection, Preparation, and Selection
For recording images of the ocular surface, the first-in-human prototype of the Occyo One system (Occyo GmbH, Innsbruck, Austria) shown in the recording step of Figure 1 was used. This device is capable of recording the ocular surface in high resolution with a single shot by matching the eye’s curvature to the flat imaging sensor. The visible ocular surface is hereby imaged with a field-of-view of 21.3 mm in width and 16 mm in height to catch the cornea, sclera, limbus, and tear film entirely in focus.22 For recording in a standardized manner, an eye tracker including a fixation target, a single mode illumination and an autofocus are implemented in the device.23 Using this device, data was generated in two studies. 
Study I (Ophthalmic Outpatient Clinic Trial)
The main aim of this trial was to determine the safety and feasibility when using the prototype, while recording multiple images of pathological eyes. The study was approved by the ethics committee of the federal state of Salzburg and complies with the Declaration of Helsinki. Informed written consent was obtained from all subjects. Within 1 year, the pathological eyes of 100 subjects were recorded at the Paracelsus Medical University in Salzburg. This resulted in recording 1130 images of pathological eyes (approximately 12 recordings for each of the 100 eyes), which were checked for integrity (image artifacts, metadata, case report form, etc.). Eight subjects had to be excluded for different reasons (e.g., withdrawal of the declaration of consent), whereas 958 recordings of the remaining 92 subjects passed the integrity checks. This is also including data from follow-up visits, which were available for 15 of the subjects, where two subjects were imaged even three times (total: 17 follow-up visits). Here, the mean time to follow-up imaging was 16.3 weeks with a standard deviation of 15.4 weeks (range, 1 day to 47.6 weeks). For the 50 female and 42 male subjects, the average age was 68.24 years with a standard deviation of 14.86 years (range, 27–93 years). 
Study II (In-House Trial)
This trial aimed to image healthy eyes (subjects without known prior ocular surface diseases) to compare them with data from the first clinical trial, where only pathological cases were imaged. For this study, a positive vote was obtained from the Research Committee for Scientific Ethical Questions of UMIT TIROL. Informed written consent was obtained from all participants. Both eyes of 30 healthy volunteers were imaged leading to a total number of 369 recordings (approximately 6 recordings for each of the 60 eyes). Only one recording did not pass the integrity check and was excluded. The subjects in this case had an average age of 37 years with a standard deviation of 8.85 (range, 22–56 years). Among them, 10 were female and 20 were male. 
Ocular Surface Diseases
The primary diseases of imaged individuals from above mentioned studies are listed in Table 1 with decreasing occurrences and result from image-based grading of a clinical expert. Although 95 subjects showed evidence of an underlying disease, this was not the case for the remaining 57 subjects. Hereby, systemic diseases were not used as a screening criteria. 
Table 1.
 
Number of Primary Diseases Per Study
Table 1.
 
Number of Primary Diseases Per Study
Image Selection Approach
Because multiple images are available for each subject, the best images were selected by evaluating the focus in specific areas of interest defined by clinical experts. For more information see Appendix A
Define Datasets
From the study datasets, 505 images were selected as mentioned in the Image Selection Approach and randomly split in a person-wise manner to avoid a prediction bias. The number of images for each respective study can be seen in Table 2, while a traditional split ratio of 80% to 10% to 10% was considered for the training, validation and test sets, respectively. To investigate the trade-off between the training effort (number of data, annotation effort, etc.) and the achieved performance, three differently sized datasets (small, 60 images; medium, 110 images; large, 457 images) for training machine learning models and the test dataset were defined from these splits, which are described in greater detail in the following. 
Table 2.
 
Number of Image Splits Per Dataset and Study
Table 2.
 
Number of Image Splits Per Dataset and Study
  • Small Dataset: From the training sets mentioned in Table 2, 60 images were randomly chosen (30 images per study) as the initial dataset to train a YOLOv8-seg model (described in Segmentation Model - YOLOv8), which was assumed to be a reasonable amount for one person’s annotation work for a day. These data were further split into a training and validation dataset with a ratio of 80% to 20% (48–12 images) and manually labeled by two observers as described in section Image Annotation. The interobserver reliability of the annotations was checked by calculating the intersection over union (IoU) for the iris (0.9489) and the ocular surface (0.9847). Owing to the high agreement of the observers, it was decided to use the labels from one observer and also to do upcoming annotation tasks only by the observer with more expertise in the field.
  • Medium Dataset: The resulting YOLOv8-seg model from training on the small dataset was used to make predictions for the rest of the training and validation data (study I, 265 images; study II, 132 images). The results were visually inspected and the predictions, which did not show sufficient results (42 images of study I and 8 images of study II), were manually labeled as mentioned before. For the upcoming training, the split of the small dataset was kept, whereas the new data resulting from the visual inspection were split again in a ratio of 80% to 20% per study and added to the training and validation set (86–24 images).
  • Big Dataset: In addition to the data from the medium set, this dataset included predictions, also called weak labels, for the rest of the training and validation data. The data split for training a machine learning model was hereby set as shown in Table 2, where 409 images were used to train and 48 images to validate the model.
  • Test Dataset: The test dataset consisted of 48 images as described in Table 2, which were manually annotated to be able to evaluate the performance of the trained models.
Image Segmentation
Image Annotation
The software Label-Studio24 was used as a tool for image annotation, because it is open source and capable of exporting the annotations to the needed format for training a YOLOv8-seg model. An example image of such an annotation task is shown in the pipeline (pretrained segmentation model step of Fig. 1), where the output is an annotated polygon for each class (iris, ocular surface). Here, the full-sized images were annotated in a simplified manner by following the contour of the visible ocular surface, where also structures like eyelashes were partly included. This was done, since the annotated data was subsequently resized (256 pixels in width and 192 pixels in height) for the training process and excluding the lashes in the full-sized images would have led to small artifacts in the resized data. In addition, redness information could be extracted for the vessels beneath the eyelashes by imaging with the novel device, because only the ocular surface and not the lashes were in focus. 
Segmentation Model - YOLOv8
For image segmentation the well-known eighth version of the You Only Look Once (YOLOv825) was used, which is a convolutional neural network for multiple applications like object detection and image segmentation. 
The architecture is shown exemplary in Figure 2. Pre-trained YOLOv8 segmentation models (YOLOv8-seg) are trained on the common objects in context dataset,26 which is a large-scale object detection, segmentation, and captioning dataset. State-of-the-art results could be achieved with the YOLOv8 model, while being well-suitable for real-time applications owing to its speed.27 
Figure 2.
 
YOLOv8-seg architecture: The implementation of the new C2f module in the feature pyramid network (FPN) is shown enlarged in the top right. The decoupled head makes anchor-free predictions and returns detection and mask coefficients separately.27,28
Figure 2.
 
YOLOv8-seg architecture: The implementation of the new C2f module in the feature pyramid network (FPN) is shown enlarged in the top right. The decoupled head makes anchor-free predictions and returns detection and mask coefficients separately.27,28
Metrics for Segmentation Evaluation
Common metrics to evaluate the performance of a segmentation task are the IoU shown in Equation (1), accuracy (Acc) shown in Equation (2) and F1 score shown in Equation (3).29,30 Here, true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) correspond with the entries in the confusion matrix. The metric scores lie within a range of 0 to 1, where 1 indicates a perfect overlap, correctness, or similarity, respectively.  
\begin{equation} IoU = \frac{TP}{TP + FP + FN} \end{equation}
(1)
 
\begin{equation} Acc = \frac{TP + TN}{TP + TN + FP + FN} \end{equation}
(2)
 
\begin{equation} F1 = \frac{2 \cdot TP}{2 \cdot TP + FP + FN} \end{equation}
(3)
 
Redness Extraction
Image-Based Redness
The image-based redness fr is calculated based on Fieguth et al.31 by Equation (4), where S is a subimage in RGB color space with its respective color channels (SR, SG, SB) without black pixels. This redness score lies in a range of −0.5 to 1.0, where a score of 1.0 represents a completely red image, while a white image yields a score of 0.0 and an image only consisting of green or blue yields a score of −0.5.  
\begin{equation} f_r(S) = \frac{1}{|S|} \sum _{i \in S} \frac{2 \cdot (S_R)_i - (S_G)_i - (S_B)_i}{2 \cdot [(S_R)_i + (S_G)_i + (S_B)_i]} \end{equation}
(4)
 
ROIs for Redness Extraction
Various approaches to selecting the visible bulbar conjunctiva as ROI for redness extraction can be found in the literature as described in our previous work.21 In this work, redness is calculated for the whole segmented ROI (global), the segmented ROI differentiated as nasal and as temporal (side-wise) and a subsampling (tiling) approach. Based on the results we will evaluate whether a redness extraction for the whole ROI is necessary or if a tiling approach is sufficient for a meaningful output as indicated in the literature.32,33 An example for all three approaches, described in greater detail in the following, can be seen in the redness extraction step of the pipeline (Fig. 1). 
  • Global: The predicted masks for the iris and the ocular surface were combined by a boolean operation to get the segmentation of the ROI for the redness extraction. These combined masks were subsequently resized to the full size to extract the redness. Out-of-focus structures like eyelashes were hereby not filtered as described in Image Annotation, because vessels were visible beneath them and, therefore, useful information could be extracted. In addition, the method to extract the redness (see Image-Based Redness) is neglecting black pixels. Hereafter, this region containing the pixels of interest for the redness extraction, is referred to as scleral segmentation.
  • Side-Wise: The ocular surface is often distinguished by side (nasal, temporal) to point out local changes. The position of the iris center was used to distinguish the segmentation in this side-wise manner.
  • Tiling: A tiling approach similar to our previous work was employed and the scleral image was divided into smaller subimages (tiles) with a size of 200 × 200 pixels. Again, the location of the tiles was defined based on the iris center. The tiles used for the redness extraction were selected based on the coverage with the scleral segmentation from the first approach. Only tiles, which completely covered this segmentation, were used for the extraction.
Iris Center Determination
The iris center was used for the above-mentioned tiling approach and a side-wise differentiation (nasal, temporal), when extracting the redness. The center coordinates are hereby determined from the prediction of the iris segmentation. For more information see Appendix B
Redness Evaluation
Repeatability Evaluation
For evaluating the repeatability of the redness extraction the coefficient of variation (CV) was used.34 This coefficient is defined as the ratio of the standard deviation σ to the mean µ and is shown in Equation (5), where it is given in percent. The CV was extracted for each subject from images obtained a few minutes apart in the same session.  
\begin{eqnarray} CV = \frac{\sigma }{\mu } \cdot 100 \% \qquad \end{eqnarray}
(5)
 
In addition, a Bland-Altman plot was used to give a visual representation of the variability. This plot illustrates the agreement between two measurements or methods by plotting the difference against the average.35 Typically, the bias (mean of differences) and the limits of agreement (commonly 1.96 times the standard deviation) are highlighted in the plot. The limits of agreement are hereby not fixed and have to be evaluated within the context of the specific application’s acceptable range of variability. 
Clinical Relation
All eyes of the subjects in both studies were graded by clinical experts using the Efron scale (Fig. 7) to obtain a reference for our algorithm. In this scale, five different levels of severity of conjunctival bulbar redness are distinguished. The results of this grading are listed in Table 3, where the study I data were graded throughout the trial in person by an expert of the ophthalmic outpatient clinic. Initially, no clinical gradings were assessed for study II and therefore, an image-based redness grading was performed by another clinical expert for each of the eyes included in this study. 
Table 3.
 
Efron Gradings of All Eyes
Table 3.
 
Efron Gradings of All Eyes
Figure 3.
 
Iris detection by using the contour of (A) a ground truth mask and (B) a predicted mask. The used contour points for determining the iris center are shown on the left-hand side drawn in orange on the white contour, while the resulting iris center drawn on the original image is shown on the right-hand side.
Figure 3.
 
Iris detection by using the contour of (A) a ground truth mask and (B) a predicted mask. The used contour points for determining the iris center are shown on the left-hand side drawn in orange on the white contour, while the resulting iris center drawn on the original image is shown on the right-hand side.
Figure 4.
 
Extracted redness scores for (A) the least red eye, (B) an eye with a moderate redness, and (C) the most red eye from the used datasets. Displayed are (from left to right, respectively) the original image, the overall extracted redness, the side-wise extracted redness and the redness resulting from the tiling approach.
Figure 4.
 
Extracted redness scores for (A) the least red eye, (B) an eye with a moderate redness, and (C) the most red eye from the used datasets. Displayed are (from left to right, respectively) the original image, the overall extracted redness, the side-wise extracted redness and the redness resulting from the tiling approach.
Figure 5.
 
Extracted redness for a subject (A) with follow-up imaging graded with Efron 4 and (B) performed after 20 weeks graded as Efron 3. The region with the greatest redness change is highlighted by being shown enlarged.
Figure 5.
 
Extracted redness for a subject (A) with follow-up imaging graded with Efron 4 and (B) performed after 20 weeks graded as Efron 3. The region with the greatest redness change is highlighted by being shown enlarged.
Figure 6.
 
Bland-Altman plot for determining the repeatability of the imaging procedure by evaluating the redness of the first and last image per session and subject (169 eyes of 122 subjects). Here, 93.5% of the differences lie within the limits, while the limits of agreement and bias are visualized by dotted blue lines and a continues red line, respectively.
Figure 6.
 
Bland-Altman plot for determining the repeatability of the imaging procedure by evaluating the redness of the first and last image per session and subject (169 eyes of 122 subjects). Here, 93.5% of the differences lie within the limits, while the limits of agreement and bias are visualized by dotted blue lines and a continues red line, respectively.
Figure 7.
 
(A) Efron scale4 and (B) example images listed in the same order as the respective grading from the recorded data.
Figure 7.
 
(A) Efron scale4 and (B) example images listed in the same order as the respective grading from the recorded data.
Statistical Evaluation
A paired t-test was used to measure the difference between the redness mean values resulting from the manual annotated ground truth and the predicted masks to evaluate the use of the predicted masks for the redness extraction. Resulting from this test is the p-value, which falls within the range of 0 to 1, where a common P value of < 0.05 was considered as statistically significant, indicating that the observed difference in means is unlikely to have occurred by random chance alone. To statistically measure the relationship between the extracted redness values and the gradings the Spearman correlation was determined.36 The strength and direction of monotonic relationships were measured by this nonparametric measure, yielding correlation coefficients ranging from −1 (perfect negative correlation) to 1 (perfect positive correlation), and 0 indicates no correlation. 
Results
Evaluate Iris Detection
The agreement of the resulting iris center coordinates by using the ground truth and the predicted mask were compared. For this purpose, the Euclidean distance between the image origin and the center coordinates was calculated in both cases and normalized by the image size for the 48 images of the test set. The resulting mean distance was 35 pixels (0.93% of image size) with a standard deviation of 20 pixels (0.55% of image size). An example of detected center coordinates based on the ground truth and the prediction is shown in Figure 3. Additionally, the detected center coordinates of all images were verified by visual inspection. 
Segmentation Model Performance Evaluation
Three different multiclass (iris, ocular surface) YOLOv8-seg models were trained using the defined datasets (small to large) to investigate the trade-off between effort and performance. The trained models were evaluated using the test dataset. The evaluation results are listed in Table 4
Table 4.
 
Evaluation of Segmentation Models on Test Data
Table 4.
 
Evaluation of Segmentation Models on Test Data
The model trained on the initial 60 images (small dataset) already shows a high performance of more than 96% for both classes. By adding the additional annotated data to the training, which resulted from the visual inspection (medium dataset), the best model performance was achieved. However, adding the predicted labels, also called weak labels, which showed no problems during the visual inspection, did not lead to further improvement. 
Redness Extraction
The conjunctival bulbar redness was extracted using the proposed pipeline (shown in Fig. 1) for three different approaches to define the ROIs for extracting the redness as mentioned in ROIs for Redness Extraction. An example of this is shown in Figure 4, where three eyes with a big score difference of conjunctival bulbar redness are visualized. 
Additionally, the follow-up data for 15 study I subjects as described in Data Collection, Preparation, and Selection was investigated. Figure 5 illustrates one of these cases, where the redness for the same subject was extracted 20 weeks after the initial visit. Here, the Efron grading decreased from a severe (4) to a moderate (3) redness. In this figure, the redness of the tiling approach is shown to get a better overview of local changes in individual sections. It shows that the eye was aligned very similarly in both cases, because the compared areas and number of tiles match well, while the redness score decreases from one visit to the next. Also, the global and side-wise extracted redness scores reflected the decreasing redness, where the redness dropped globally from 0.0867 to 0.0703, nasal from 0.09277 to 0.0740, and temporal from 0.0808 to 0.0668. Further examples of follow-up data are included in Appendix C (Figs. C1C3). 
Figure 8.
 
Comparison of extracted redness values of the test dataset resulting from the use of (top) the manual and (bottom) the predicted masks grouped by Efron grading. Redness scores resulting from the three mentioned approaches defining the redness extraction ROIs (global, side-wise, and tiling approach) are shown next to each other.
Figure 8.
 
Comparison of extracted redness values of the test dataset resulting from the use of (top) the manual and (bottom) the predicted masks grouped by Efron grading. Redness scores resulting from the three mentioned approaches defining the redness extraction ROIs (global, side-wise, and tiling approach) are shown next to each other.
Figure 9.
 
Redness scores of all images grouped by clinical grading. The redness scores resulting from the three mentioned approaches defining the ROIs for redness extraction (global, side-wise, tiling) are shown next to each other.
Figure 9.
 
Redness scores of all images grouped by clinical grading. The redness scores resulting from the three mentioned approaches defining the ROIs for redness extraction (global, side-wise, tiling) are shown next to each other.
Figure 10.
 
Digital scale for conjunctival bulbar redness of the nasal and temporal eye regions resulting from the proposed pipeline.
Figure 10.
 
Digital scale for conjunctival bulbar redness of the nasal and temporal eye regions resulting from the proposed pipeline.
To evaluate the repeatability of the redness extraction from images of the same subject within one session, the CV as defined in Equation (5) was determined for the global ROI. This resulted in a mean of 4.09% and a standard deviation of 3.75%. Here, the time interval between these recordings ranged from 4 seconds to 18 minutes, while the mean duration was given with 3.2 minutes. In addition, a Bland-Altman plot (described in Repeatability Evaluation) is used to compare the redness scores of the first and the last image in a session per subject for evaluating the repeatability of the procedure (cf. Fig. 6). A narrow range of agreement (−0.0093 to 0.0105) encompasses 93.5% of the 169 differences (study I, 92 eyes from 92 subjects plus 17 follow-ups; study II, 60 eyes from 30 subjects), whereas a small systematic bias of 0.0006 can be observed. Furthermore, the relative variability (half width of limits of agreement divided by maximum value) was determined to evaluate the clinical acceptability, which yielded 4.8%. 
Relation to Clinical Grading
All resulting extracted redness scores were additionally grouped by their Efron gradings to investigate their relation to this reference. Example images sorted according to their grading are shown next to the Efron scale in Figure 7 for illustration. 
Additionally, the influence of using the manual annotations compared to the predicted ones for the redness extraction was evaluated on the test dataset by a t-test, with the hypothesis that the extracted redness values are not significantly different for both. This was assumed owing to the high performance of the segmentation model of more than 96%. The paired t-test yielded P values of 0.1955 for the global and 0.1948 (temporal) to 0.5694 (nasal) for the side-wise approach, indicating no statistically significant difference between the mean of the redness scores. Hence, the null hypothesis is not rejected, suggesting that using the manual over the predicted segmentation leads to no significant difference in redness. For the tiling approach the paired t-test showed P values of less than 0.05, while also the number of tiles and, therefore, of extracted redness values differed for some of the cases. Investigations showed that this tile number difference is caused by the slightly different tile position resulting from the segmentations, which leads to the removal or adding of tiles near the border. Figure 8 shows all redness values for this comparison grouped by Efron grading and extraction approach. Afterward, the redness values extracted by the proposed pipeline for all 505 images of the 122 subjects were grouped by clinical grading and visualized in Figure 9
A more detailed statistical description of the values is provided in Appendix C. The analysis of the results shown in the box-plot revealed a consistently positive trend across the five gradings, with Spearman correlation coefficients ranging from 0.520 (nasal) and 0.524 (temporal) for the tiling to 0.599 for the global examinations. The P values for these correlations are given with 2.06e-36, 5.30e-37 and 1.54e-50, respectively. 
Comparison to Image-Based Grading Scale
The DBR scale provides redness scores in linear increments from 10 to 100.5 Because the approach to determine the redness is similar to the proposed approach, our results were compared by assigning the images to 10 bins based on their maximum side-wise redness score. This resulted in a linear scale with the same number of increments, which was comparable with the DBR scale. Additionally, one image from each bin was selected to illustrate the different levels of our scale as shown in Figure 10, where for the last two bins only the nasal or the temporal side were available due to the lack of data from subjects with severe redness. 
Additionally, the number of Efron grading occurrences per bin were counted by assessing the image with the highest side-wise redness for each of the 152 eyes and the 17 follow-up visits. The results of this investigation are shown in Table 5. If the digital scale is correlated to the number of gradings and divided into five levels like it is the case in the Efron scale (Efron 0, 0–10; Efron 1, 10–30; Efron 2, 30–40; Efron 3, 40–60; Efron 4, 60–100) based on the most occurrences as indicated in the table, it can be viewed as a classification problem. Therefore, accuracies of 0.75, 0.60, 0.41, 0.38, and 0.33 can be calculated for the new classification, respectively. 
Table 5.
 
Number of Efron Gradings for the Digital Scale of All 152 Eyes and the 17 Follow-Up Visits Per Bin
Table 5.
 
Number of Efron Gradings for the Digital Scale of All 152 Eyes and the 17 Follow-Up Visits Per Bin
Conclusion and Discussion
In this work, a data analysis pipeline for conjunctival bulbar redness extraction and a dataset comprising more than 500 high-quality images of healthy and pathological cases, recorded with a novel ocular surface photography system, were established. As part of the pipeline, a state-of-the-art model (YOLOv8-seg) for the accurate segmentation of the iris and ocular surface was trained to define ROIs based on the iris center. For these regions, the redness was extracted by calculating image-based scores to complete the pipeline. Finally, a thorough validation was carried out by relating the resulting redness scores to clinical gradings and a similar digital scale. In the following, the key findings and achievements of this work are described and discussed. 
An approach for determining the iris center based on the contour of its segmentation was proposed. These center coordinates were used to define the positions of the focus ROIs to make a side-wise (nasal, temporal) differentiation for the redness extraction and to use it for positioning in the tiling approach. Because of its many use cases it was investigated, if an influence of detecting these center coordinates by using the predicted over the manual annotation could be noticed. Similar results were observed in both cases. Based on visual inspection, it was assumed that the use of the ground truth annotations for the center detection provides good results. Hence, this evaluation showed that the determined centers, resulting from the predictions, delivered a similar output with a mean center distance difference of 35 pixels (0.93% of image size) and a standard deviation of 20 pixels (0.55% of image size). For the future, this approach has to be further verified by comparing the results to a manually defined iris center ground truth. This verification would justify the use of the center detection without visually inspecting the images in the first place. 
Three YOLOv8-seg models were trained as described in Segmentation Model Performance Evaluation by using three different dataset sizes to investigate the trade-off between effort (number of images, annotation time, etc.) and prediction performance. It was observed that already by training a model on initially 60 images a very high segmentation performance was achieved with metric scores above 96% (cf. Table 4). The best-performing model was achieved by adding data, which failed visual inspection of evaluating predictions from the initial model for the rest of the training and validation data. However, it could be observed that adding weak labels to the training leads to a slight performance drop compared with the initial model. We assume that no gain is achieved by their use during training, because no new features were learned from the weak labels. This shows that by using a well-established segmentation model the effort can be kept rather small (about a day) to achieve already a very good performance and can be further improved by including training data with new features. Furthermore, the high performance with small sample size, especially for diseases, suggests that the model benefits from the standardized image modality. 
The direct comparison of the trained segmentation models to other state-of-the-art models was not in the focus of this work, since it could be shown in Relation to Clinical Grading that the resulting model Acc yielded no significant influence for the redness extraction when compared to the ground truth. Still, a comparison with a state-of-the-art U-Net segmentation model, which was trained on the small dataset and evaluated on the test dataset, was made as described in Appendix D. The U-Net performed very similarly for the segmentation task, resulting in IoUs of 0.9636 and 0.9793, accuracies of 0.9903 and 0.9899, and F1-scores of 0.9814 and 0.9889 for the iris and ocular surface, respectively. In addition, the performance of models from related works was investigated for comparison and listed in the following. Here, it showed that owing to the popularity increase of eye tracking applications in the last few years, the number of works dealing with the segmentation of the iris increased.3739 Also, approaches aiming at segmenting the ocular surface or sclera were found in literature.19,4042 For example in the related work of Sardar et al.,38 a mean TP rate of 0.983 with a mean error rate of 0.261 could be achieved for iris segmentation from publicly available iris datasets by using an interactive deep learning approach. A residual encoder and decoder network was developed in the work of Naqvi et al.,41 which achieved an equal error rate and mean F1-score of 0.009 and 96.242 for segmenting the sclera. A mobile monitoring application for ocular redness was implemented by Li et al.,29 where an Acc of 0.992, an IoU of 0.977 and an F1-score of 0.982 for the segmentation of the sclera using a U-Net architecture could be achieved. These corresponding works show similar performance scores for different imaging modalities like a different image resolution ranging from 400 × 300 pixels to 3,000 × 1,700 pixels, different gaze directions and distances to the eye. In addition, the number of images used for training in these works exceeded one thousand going up to several thousand images, while the subjects are either healthy or show ocular surface pathologies like conjunctivitis and subconjunctival hemorrhage. 
The conjunctival bulbar redness for 505 images of 122 subjects was extracted by determining image-based redness scores (see Redness Extraction), which can be potentially seen as imaging biomarker. As shown in Table 1, the primary disease was determined by a clinical expert, ranging from no ocular disease to pathological cases like pinueculum, pterygium, and dry eye disease. Three different ROIs were evaluated for each image, where first a global score for the whole segmentation of the bulbar conjunctiva was determined; second, two scores based on eye side (nasal, temporal) were determined from the segmentation; and third, the nasal and temporal redness was determined using a tiling approach. The repeatability of the extracted redness values resulting from multiple images of the same subject within an imaging session was demonstrated by a mean CV of 4.09% with a standard deviation of 3.75%. Furthermore, this repeatability of the imaging procedure was investigated in more detail by employing a Bland-Altman plot, where the redness scores from the first and the last images from each session were compared. Here, the time that passed between these recordings ranged from a few seconds to a few minutes. This plot showed a small bias (0.0006), while the number of differences within the limits of agreement was 93.5%. We are confident that the low bias and high agreement show the potential for clinical acceptance, indicated also by a small relative variability of 4.8%. Additionally, the redness of the test dataset was extracted for the predicted segmentation and the manually annotated ground truth to determine whether a significant difference between these two approaches is detectable. No significant difference of the extracted redness in both cases using a paired t-test could be observed for the global and side-wise approach. This is assumed to be due to the high performance of the segmentation models. The tiling approach showed P values of less than 0.05, while also the number of tiles differed, reflecting the limitations from the strict requirement of using only tiles that completely match the segmentation. Using the proposed solution clinicians are able to image the eye in high-resolution with a single shot and can determine the redness from the whole visible ocular surface. A limitation from ocular surface photographs of subjects looking straight into the camera is that areas on the ocular surface, which can be relevant for redness extraction, could potentially be covered by the eyelids. Here, it could be advantageous to image the eye with the subject looking sideways as pointed out in different works found in literature.15,33 For this work, straight-looking eyes are sufficient to grade the ocular redness, because also the Efron scale is generally based on these regions. In addition, this work focused on evaluating the redness from standardized photographs of people with a line of sight directed toward the camera as intended by the used prototype, rather than the imaging procedure itself. Still, for future evaluation the additional feature to record images with different gaze directions would be interesting, enabling also a better comparison to other reference image-based grading scales, where these areas are visualized as well. 
When the image-based redness scores were validated against the clinical gradings from the Efron scale (see Relation to Clinical Grading), a clear positive trend was observed for most levels of redness severity, yielding a moderate Spearman correlation coefficient of 0.599 for the global redness assessment. The local extractions (side-wise, tiling) yielded slightly weaker correlations, which might be attributed to the nature of the used clinical grading scale, which assesses the grade also in a global manner. When the globally extracted values for higher severities (3 and 4) were compared, it showed that the positive trend was visually not as clear as it was for the local extraction approaches (side-wise, tiling). Nevertheless, in these cases bigger standard deviations for grading 4 were observed, especially for the temporal side, which led to the conclusion that these eyes were graded higher due to a higher local redness in these areas. This is not reflected in the Efron scale, since its grading is only accounting for the global redness and only one clinical expert used it for grading the cases shown in this work. Additionally, it was observed that the trace (1) grading shows the most outliers, which was attributed to structures on the surface (e.g., conjunctival melanosis). Therefore, removing such areas from the extraction process is a necessity in the future. Furthermore, reproducibility between data from initial visits and follow-up visits was demonstrated by multiple illustrated cases. 
Similar to our work, the DBR scale was created by extracting the redness after Fieguth et al.31 Hence, in Comparison to Image-Based Grading Scale, all recordings were grouped in 10 bins based on their redness scores to create a scale with the same numbers of severity to make our results comparable to this scale. A different range of approximately −0.01 to 0.20 for our approach compared with 0.35 to 0.60 of the DBR scale was observed. Because the extracted redness is an intensity-based feature, we assumed that this range difference was caused by using different imaging devices with different color calibration settings. This assumption is reinforced by the fact that pixels with a similarly large amplitude for all three color channels like the sclera would end up with a redness value of approximately zero, which was present in our work for example in the most white eye (Fig. 4A), but not in the DBR scale. Hence, device settings and their calibration play a major role in image-based analysis to get reproducible and comprehensible results. This section also revealed one limitation of our work in terms of redness grading, the lack of data with severe conjunctival bulbar redness. This was also reflected in Figure 10, because not every category of our digital image scale was present. Another limitation became obvious, when correlating our digital scale to the number of Efron gradings per bin. This was resulting in classification accuracies ranging from 0.33 (severe) to 0.75 (normal), which showed that also here the lack of severe cases was affecting the results, reflecting also the limited grading by only one clinical expert. Therefore, more data and especially data with expanded disease diversity and severe redness are needed in the future, while also the correlation between in-person and image-based grading has to be investigated. The division into the same number of levels based on the number of gradings per bin also points out that the Efron scale is not linear, as mentioned by Baudouin et al.3 
When further investigating recent related works in the literature, image classification models for bulbar redness grading become more and more evident. They can be used to get the grading right away, instead of segmenting the areas and extracting the redness scores. A good example for this is the work from Wang et al.20 Their proposed DeepORedNet achieved classification accuracies ranging from 94.69 to 99.93 for four levels of redness, outperforming various well-established models like the ResNet on a dataset recorded using the slit-lamp, compromising 2411 images with a resolution of 1,624 × 1,232 pixels. Alhough such approaches are useful to support clinicians with deciding on a global grading, they are lacking location-based feedback to evaluate where the resulting change in redness occurs. 
In conclusion, the proposed image-based conjunctival bulbar redness extraction pipeline demonstrates that by employing standardized imaging, a segmentation model and image-based redness assessment external eye photography can be classified and evaluated. Therefore, it shows the potential to provide eye care professionals with an objective tool to grade ocular redness and facilitate clinical decision-making. Essential is hereby a high-quality, standardized and reproducible ocular surface overview photograph as captured by the prototype employed in this work. For further improvements and investigations, additional image data are of major importance, especially data with more severe eye redness, which is planned to be included by performing additional clinical trials with the novel device. The segmentation model used to define the ROIs for these data showed promising results, yielding values similar to the interobserver agreement of the manual annotation. Still, this model has to be re-evaluated and retrained with the accessibility to new data, because the dataset used in this work was limited in size, and to some extent, also in diversity. The aim here is to achieve a performance level high enough to reliably extract the redness, as shown. Here, also other segmentation models could be used as pointed out earlier by the trained U-Net, where the standardization of the images and the fact that the task of segmenting the two classes (iris, ocular surface) was not too complex, led to similar performance scores. In the future, we want to further improve the pipeline for example by including additional features like blood vessel density or by incorporating alternative approaches like the use of classification models as found in literature19,20 to enable image-based diagnosis for external eye photography and move toward a telemedicine setup. 
Acknowledgments
Supported by the federal state of Tyrol (Austria) as part of the ImplEYE K-Regio project. 
Disclosure: P. Ostheimer, None, A. Lins, Occyo (E); L.A. Helle, None; V. Romano, Occyo (O); B. Steger, Occyo (O); M. Augustin, Occyo (O); D. Baumgarten, None 
References
Hessen M and Akpek EK. Dry eye: an inflammatory ocular disease. J Ophthalmic Vis Res. 2014; 9(2): 240. [PubMed]
Romano V, Steger B, Ahmad M, et al. Imaging of vascular abnormalities in ocular surface disease. Surv Ophthalmol 2022; 67(1): 31–51. [CrossRef] [PubMed]
Baudouin C, Barton K, Cucherat M, Traverso C. The measurement of bulbar hyperemia: challenges and pitfalls. Eur J Ophthalmol. 2015; 25(4): 273–279. [CrossRef] [PubMed]
Efron N:Morgan B, Katsara SS. Validation of grading scales for contact lens complications. Ophthalmic Physiol Opt. 2001; 21(1): 17–29. [CrossRef] [PubMed]
Macchi I, Bunya VY, Massaro-Giordano M, et al. A new scale for the assessment of conjunctival bulbar redness. Ocular Surf. 2018; 16(4): 436–440. [CrossRef]
Begley C, Caffery B, Chalmers R, Situ P, Simpson T, Nelson JD. Review and analysis of grading scales for ocular surface staining. Ocular Surf. 2019; 17(2): 208–220. [CrossRef]
Singh RB, Liu L, Anchouche S, et al. Ocular redness–I: etiology, pathogenesis, and assessment of conjunctival hyperemia. Ocular Surf. 2021; 21: 134–144. [CrossRef]
Mártonyi CL, Bahn CF, Meyer RF. Slit lamp: examination and photography. Third Edition. Time One Ink; 2007.
Kapoor R, Walters SP, Al-Aswad LA. The current state of artificial intelligence in ophthalmology. Surv Ophthalmol 2019; 64(2): 233–240. [CrossRef] [PubMed]
Schmidl D, Schlatter A, Chua J, et al. Novel approaches for imagingbased diagnosis of ocular surface disease. Diagnostics. 2020; 10(8): 589. [CrossRef]
Srivastava O, Tennant M, Grewal P, et al. Artificial intelligence and machine learning in ophthalmology: a review. Indian J Ophthalmol. 2023; 71(1): 11. [CrossRef] [PubMed]
Ting DSJ, Foo VH, Yang LWY, et al. Artificial intelligence for anterior segment diseases: emerging applications in ophthalmology. Br J Ophthalmol. 2021; 105(2): 158–168. [CrossRef] [PubMed]
Tong Y, Lu W, Yu Y, Shen Y. Application of machine learning in ophthalmic imaging modalities. Eye Vis. 2020; 7(1): 1–15.
Wu X, Liu L, Zhao L, et al. Application of artificial intelligence in anterior segment ophthalmic diseases: diversity and standardization. Ann Transl Med. 2020; 8(11): 714. [CrossRef] [PubMed]
Amparo F, Wang H, Emami-Naeini P, Karimian P, Dana R. The Ocular Redness Index: a novel automated method for measuring ocular injection. Invest Ophthalmol Vis Sci. 2013; 54(7): 4821–4826. [CrossRef] [PubMed]
Brea MLS, Rodríguez NB, Maroño NS, González AM, García-Resúa C, Fernández MJG. On the development of conjunctival hyperemia computer-assisted diagnosis tools: influence of feature selection and class imbalance in automatic gradings. Artificial Intelligence in Medicine. 2016; 71: 30–42. [CrossRef] [PubMed]
Papas EB . Key factors in the subjective and objective assessment of conjunctival erythema. Invest Ophthalmol Vis Sci. 2000; 41(3): 687–691. [PubMed]
Park IK, Chun YS, Kim KG, Yang HK, Hwang J-M. New clinical grading scales and objective measurement for conjunctival injection. Invest Ophthalmol Vis Sci. 2013; 54(8): 5249–5257. [CrossRef] [PubMed]
Li M, Huang K, Ma X, Wang Y, Wen F, Chen Q. Mask distillation network for conjunctival hyperemia severity classification. Mach Intell Res. 2023; 20(6): 909–922. [CrossRef]
Wang S, He J, He X, Jiaoyue H, Liu Z, Luo Z. DEEPOREDNET: contrastive learning-based attention-weighted dual channel residual network for ocular redness assessment. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Seoul, Republic of Korea: IEEE. 2024: 3080–3084, https://ieeexplore.ieee.org/abstract/document/10447056.
Ostheimer P, Lins A, Massow B, Steger B, Baumgarten D, Augustin M. Extraction of eye redness for standardized ocular surface photography. In: Antony B, Fu H, Lee CS, MacGillivray T, Xu Y, Zheng Y. eds. Ophthalmic Medical Image Analysis. OMIA. Lecture Notes in Computer Science, vol 13576. Cham: Springer. 2022: 193–202.
Augustin M, Ostheimer P, Hausmann U, Baumgarten D, Romano V, Steger B. An imaging system for standardized and enhanced photographs of the ocular surface. Invest Ophthalmol Vis Sci. 2023; 64(9): PB0095–PB0095.
Augustin M, Ostheimer P, Baumgarten D, Hausmann U, Romano V, Steger B. Standardized imaging of the ocular surface using a novel external eye photography system. Invest Ophthalmol Vis Sci. 2023; 64(8): 3405–3405.
Tkachenko M, Malyuk M, Holmanyuk A, Liubimov N. Label Studio: data labeling software. Open source software. 2020–2024. Available from: https://github.com/HumanSignal/label-studio.
Jocher G, Chaurasia A, Qiu J. Ultralytics YOLOv8. Version 8.0.0. 2023. Available at: https://github.com/ultralytics/ultralytics.
Lin T-Y, Maire M, Belongie S, et al. Microsoft COCO: common objects in context. In: Fleet D, Pajdla T, Schiele B, Tuytelaars T. eds. Computer Vision–ECCV2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer. 2014: 740–755.
Terven J, Córdova-Esparza D-M, Romero-González J-A. A comprehensive review of YOLO architectures in computer vision: from YOLOv1 to YOLOv8 and YOLO-NAS. Mach Learn Knowl Extr. 2023; 5(4): 1680–1716. [CrossRef]
Zhao X, Ding W, An Y, et al. Fast Segment Anything. arXiv preprint arXiv:2306.12156. 2023.
Li Y, Tam VWL, Chiu PW, Lee A, Zhu Y, Lam EY. A deep-learning-enabled monitoring system for ocular redness assessment. 2023 IEEE Biomedical Circuits and Systems Conference (BioCAS). IEEE. 2023: 1–5, https://ieeexplore.ieee.org/abstract/document/10388997.
Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015: 3431–3440, doi:10.1109/CVPR.2015.7298965.
Fieguth P Simpson T. Automated measurement of bulbar redness. Invest Ophthalmol Vis Sci. 2002; 43(2): 340–347. [PubMed]
Sánchez Brea ML, Rodríguez NB, González AM, Evans K, Pena-Verdeal H. Defining the optimal region of interest for hyperemia grading in the bulbar conjunctiva. Comput Math Methods Med. 2016; 2016: 3695014. [CrossRef] [PubMed]
Rodriguez JD, Johnston PR, Ousler GW, III, Smith LM, Abelson MB. Automated grading system for evaluation of ocular redness associated with dry eye. Clin Ophthalmol. 2013; 7: 1197–1204. [PubMed]
Brown CE . Coefficient of variation. In: Applied Multivariate Statistics in Geohydrology and Related Sciences. Berlin, Heidelberg: Springer Berlin Heidelberg, 1998: 155–157, doi:10.1007/978-3-642-80328-4_13.
Altman DG and Bland JM. Measurement in medicine: the analysis of method comparison studies. J R Stat Soc Series D: Stat. 1983; 32(3): 307–317.
Spearman C . The proof and measurement of association between two things. Am J Psychol. 1904; 15(1): 72–101. [CrossRef]
Chen Y, Gan H, Chen H. Accurate iris segmentation and recognition using an end-to-end unified framework based on MADNet and DSANet. Neurocomputing. 2023; 517: 264–278. [CrossRef]
Sardar M, Banerjee S, Mitra S. Iris segmentation using interactive deep learning. IEEE Access. 2020; 8: 219322–219330. [CrossRef]
Wang C, Muhammad J, Wang Y, He Z, Sun Z. Towards complete and accurate iris segmentation using deep multi-task attention network for non-cooperative iris recognition. IEEE Trans Inf Forensics Secur. 2020; 15: 2944–2959. [CrossRef]
Alkassar S, Woo WL, Dlay SS, Chambers JA. Robust sclera recognition system with novel sclera segmentation and validation techniques. IEEE Transactions on Systems, Man, and Cybernetics: Systems. 2017; 47(3): 474–486. [CrossRef]
Naqvi RA and Loh W-K. Sclera-net: accurate sclera segmentation in various sensor images based on residual encoder and decoder network. IEEE Access. 2019; 7: 98208–98227. [CrossRef]
Rot P, Vitek M, Grm K, Emeršic Ž, Peer P, Štruc V. Deep sclera segmentation and recognition. Handb Vasc Biom. 2020: 395–432.
Sun Y, Duthaler S, Nelson BJ. Autofocusing algorithm selection in computer microscopy. 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE. 2005: 70–76, doi:10.1109/IROS.2005.1545017.
Ronneberger O:Fischer P, Brox T. Unet: convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells W, Frangi A. eds. Medical Image Computing and Computer-Assisted Intervention–MICCAI2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. New York: Springer. 2015: 234–241.
Appendix A: Detailed Image Selection Approach
Selection Process
Because multiple images are available for each subject, the best images were selected by evaluating the focus in specific areas of interest as defined by our clinical experts. The resulting areas are shown in Figure A1. For the redness extraction, regions 3, 4, 8, 9, 15, and 21 described in this figure are the most relevant ones, since they lie on the blood vessels permeated ocular surface, and therefore, were considered for the image selection. For each of those regions, the focus was calculated by using the Tenenbaum Gradient,43 which is shown in Equation (6), where sums of the squared gradient vector components of the Sobel operators in the x- and y-directions are calculated for each pixel position (i, j) in the image:  
\begin{equation} F_{tenenbaum} = \sum _i \sum _j S_x(i,j)^2 + S_y(i,j)^2 \end{equation}
(6)
 
Figure A1.
 
Layout of the clockwise numbered ROIs for the focus evaluation as defined by our clinical experts. In addition, examples of the predicted bounding box and the iris are visualized to provide a better overview.
Figure A1.
 
Layout of the clockwise numbered ROIs for the focus evaluation as defined by our clinical experts. In addition, examples of the predicted bounding box and the iris are visualized to provide a better overview.
Figure A2.
 
Shows tiled-up and overlapped scleral segmentations to visualize the present ocular surface regions in all images. Hereby the number of occurrences per pixel was normalized by the total number of images. Also, the ROIs used for focus determination were drawn on the image with red rectangles to evaluate their positions.
Figure A2.
 
Shows tiled-up and overlapped scleral segmentations to visualize the present ocular surface regions in all images. Hereby the number of occurrences per pixel was normalized by the total number of images. Also, the ROIs used for focus determination were drawn on the image with red rectangles to evaluate their positions.
For these values, a relative focus quality factor was defined as shown in Equation (7), where the images of each eye and visit are compared separately. This factor lies in the range of 0 to 1 and consists of the following two components, which are weighted evenly: 
  • occmax: Compare the focus values of each region for all images and increase the value by 1 for each image, where the maximum focus value is achieved for one region. Afterwards, divide the factor by the number of focus ROIs (normalize between 0 to 1). For example, if five of the six focus ROIs of an image would show the highest focus values compared with other images of the same person the factor would be 0.8 (5 divided by 6).
  • sumrank: Assigned scores based on the ranking (higher ranking leads to a higher score, for example, if 6 images are available for the subject the ROI with ranking 1 would get a score of 6, the second one a score of 5 and so on) and calculate the sum of these scores for each image individually. Scale the sum to a range of 0 to 1 by dividing it by the number of total focus rankings (number of images from the same subject times the number of ROIs).
 
\begin{equation} focus_{rel} = \frac{occ_{max} + sum_{rank}}{2} \end{equation}
(7)
 
By using this relative focus factor up to the best three images per eye per visit were selected, resulting in a total of 505 images from the 92 clinical eyes and the 60 healthy eyes. 
Evaluation of Focus ROIs
The chosen focus ROI positions for the image selection process were evaluated to see, if they were placed well. Therefore, the focus ROIs of all images were visually inspected and marked, if only showing ocular surface (not showing skin, eyelashes, etc.). The resulting percentages of ROIs fulfilling this condition are listed in Table A1
Table A1.
 
Percentage of Focus ROIs Marked in the Case of Showing Ocular Surface From Visual Inspection of All 505 Images for the Focus Determination
Table A1.
 
Percentage of Focus ROIs Marked in the Case of Showing Ocular Surface From Visual Inspection of All 505 Images for the Focus Determination
Additionally, we investigated which regions showing the ocular surface were the most present in all images. Therefore, the scleral segmentations were also tiled-up with a tile size of 200 times 200 pixels, overlapped based on the iris center and visualized in Figure A2. It showed that the chosen focus ROIs were close to the area of maximal occurrences. 
Appendix B: Detailed Iris Center Determination
Because the eye is centered and the line of sight is straight, when using the above-mentioned imaging system, the iris seems to be circular in the images. In addition, the eye closes in a vertical direction and investigations showed that the iris was displayed at its full size in the horizontal direction, when detected by the deep learning model. For this reason, the following assumptions were made: 
  • The x-coordinate of the iris center is the same as the x-coordinate of the bounding box center.
  • The radius r of the iris is the same as half the width of the bounding box.
Consequently, if the formula of the circle shown in Equation (8) is used, the only missing value for determining the iris center point C(xcenter, ycenter) is the y-coordinate. Here, a point P(x, y) on the contour of the circle has to be known, which was in our case resulting by determining the contour of the predicted iris segmentation. Additionally, to make the detection more robust against outliers (i.e., contours of pathological eyes), the resulting y-coordinates of possible iris center positions were sorted by value and the median value was used.  
\begin{eqnarray} (x - x_{center})^2 + (y - y_{center})^2 = r^2 \qquad \end{eqnarray}
(8)
 
To evaluate the position of the iris center between different modalities (e.g., resulting from manual or predicted mask) the Euclidean distance normalized by the image size (width, height) as shown in Equation (9) was used.  
\begin{eqnarray} d_{norm}(x,y) = \sqrt{\left( \frac{x}{width} \right)^2 + \left( \frac{y}{height} \right)^2} \qquad \end{eqnarray}
(9)
 
Appendix C: Descriptive Statistic and Additional Follow-up Data for Relation to Clinical Gradings
In this section, additional material can be found, resulting from the redness extraction (see Redness Extraction) and relation to clinical grading (section Relation to Clinical Grading). 
Further follow-up data are included here (Figs. C1C3) to further provide proof of the reproducibility of the imaging system, while the respective redness scores for the shown images are listed in Table C1
Figure C1.
 
Shows (A) the initial visit of a study I subject and (B) the same subject imaged after 25.6 weeks, while in both cases the tiling redness extraction approach is visualized. Here, the initial Efron grading was 1 and the follow-up was graded with Efron 2.
Figure C1.
 
Shows (A) the initial visit of a study I subject and (B) the same subject imaged after 25.6 weeks, while in both cases the tiling redness extraction approach is visualized. Here, the initial Efron grading was 1 and the follow-up was graded with Efron 2.
Figure C2.
 
(A) The initial visit of a study I subject and (B) the same subject imaged after 1.9 weeks. In both cases the tiling redness extraction approach is visualized. Here, the initial Efron grading was 1 and the follow-up was graded with Efron 2.
Figure C2.
 
(A) The initial visit of a study I subject and (B) the same subject imaged after 1.9 weeks. In both cases the tiling redness extraction approach is visualized. Here, the initial Efron grading was 1 and the follow-up was graded with Efron 2.
Figure C3.
 
(A) The initial visit of a study I subject and (B) the same subject imaged after 11.9 weeks, while in both cases the tiling redness extraction approach is visualized. Here, the initial Efron grading was 3 and the follow-up was also graded with Efron 3.
Figure C3.
 
(A) The initial visit of a study I subject and (B) the same subject imaged after 11.9 weeks, while in both cases the tiling redness extraction approach is visualized. Here, the initial Efron grading was 3 and the follow-up was also graded with Efron 3.
Table C1.
 
Redness Extraction Values for Additional Follow-Ups
Table C1.
 
Redness Extraction Values for Additional Follow-Ups
In addition, for a better overview the median, the first quartile (25th percentile) and the third quartile (75th percentile) of each box shown Figure 9 are listed in Table C2
  
Table C2.
 
The Median, the First Quartile (Q1) and the Third Quartile (Q3) of Each Box Shown in Figure 9
Table C2.
 
The Median, the First Quartile (Q1) and the Third Quartile (Q3) of Each Box Shown in Figure 9
Appendix D: Comparison With a State-of-the-Art Segmentation Model
The small dataset (described in Define Datasets) was used to train a state-of-the-art U-Net segmentation model.44 Using this model the performance can be compared to the YOLOv8-seg model (see Segmentation Model Performance Evaluation) by evaluating the test dataset. The U-Net performed very similarly for the segmentation task, resulting in the performance metrics shown in Table D1. Here, also the performance scores of the YOLOv8-seg model trained on the same dataset are shown to provide a better overview. 
Table D1.
 
Additional Evaluation of Segmentation Models on Test Data
Table D1.
 
Additional Evaluation of Segmentation Models on Test Data
Figure 1.
 
Proposed pipeline for extracting the conjunctival bulbar redness, where an image-based redness is determined for different ROIs, which result from using a segmentation model. The redness is evaluated and related to clinical grading.
Figure 1.
 
Proposed pipeline for extracting the conjunctival bulbar redness, where an image-based redness is determined for different ROIs, which result from using a segmentation model. The redness is evaluated and related to clinical grading.
Figure 2.
 
YOLOv8-seg architecture: The implementation of the new C2f module in the feature pyramid network (FPN) is shown enlarged in the top right. The decoupled head makes anchor-free predictions and returns detection and mask coefficients separately.27,28
Figure 2.
 
YOLOv8-seg architecture: The implementation of the new C2f module in the feature pyramid network (FPN) is shown enlarged in the top right. The decoupled head makes anchor-free predictions and returns detection and mask coefficients separately.27,28
Figure 3.
 
Iris detection by using the contour of (A) a ground truth mask and (B) a predicted mask. The used contour points for determining the iris center are shown on the left-hand side drawn in orange on the white contour, while the resulting iris center drawn on the original image is shown on the right-hand side.
Figure 3.
 
Iris detection by using the contour of (A) a ground truth mask and (B) a predicted mask. The used contour points for determining the iris center are shown on the left-hand side drawn in orange on the white contour, while the resulting iris center drawn on the original image is shown on the right-hand side.
Figure 4.
 
Extracted redness scores for (A) the least red eye, (B) an eye with a moderate redness, and (C) the most red eye from the used datasets. Displayed are (from left to right, respectively) the original image, the overall extracted redness, the side-wise extracted redness and the redness resulting from the tiling approach.
Figure 4.
 
Extracted redness scores for (A) the least red eye, (B) an eye with a moderate redness, and (C) the most red eye from the used datasets. Displayed are (from left to right, respectively) the original image, the overall extracted redness, the side-wise extracted redness and the redness resulting from the tiling approach.
Figure 5.
 
Extracted redness for a subject (A) with follow-up imaging graded with Efron 4 and (B) performed after 20 weeks graded as Efron 3. The region with the greatest redness change is highlighted by being shown enlarged.
Figure 5.
 
Extracted redness for a subject (A) with follow-up imaging graded with Efron 4 and (B) performed after 20 weeks graded as Efron 3. The region with the greatest redness change is highlighted by being shown enlarged.
Figure 6.
 
Bland-Altman plot for determining the repeatability of the imaging procedure by evaluating the redness of the first and last image per session and subject (169 eyes of 122 subjects). Here, 93.5% of the differences lie within the limits, while the limits of agreement and bias are visualized by dotted blue lines and a continues red line, respectively.
Figure 6.
 
Bland-Altman plot for determining the repeatability of the imaging procedure by evaluating the redness of the first and last image per session and subject (169 eyes of 122 subjects). Here, 93.5% of the differences lie within the limits, while the limits of agreement and bias are visualized by dotted blue lines and a continues red line, respectively.
Figure 7.
 
(A) Efron scale4 and (B) example images listed in the same order as the respective grading from the recorded data.
Figure 7.
 
(A) Efron scale4 and (B) example images listed in the same order as the respective grading from the recorded data.
Figure 8.
 
Comparison of extracted redness values of the test dataset resulting from the use of (top) the manual and (bottom) the predicted masks grouped by Efron grading. Redness scores resulting from the three mentioned approaches defining the redness extraction ROIs (global, side-wise, and tiling approach) are shown next to each other.
Figure 8.
 
Comparison of extracted redness values of the test dataset resulting from the use of (top) the manual and (bottom) the predicted masks grouped by Efron grading. Redness scores resulting from the three mentioned approaches defining the redness extraction ROIs (global, side-wise, and tiling approach) are shown next to each other.
Figure 9.
 
Redness scores of all images grouped by clinical grading. The redness scores resulting from the three mentioned approaches defining the ROIs for redness extraction (global, side-wise, tiling) are shown next to each other.
Figure 9.
 
Redness scores of all images grouped by clinical grading. The redness scores resulting from the three mentioned approaches defining the ROIs for redness extraction (global, side-wise, tiling) are shown next to each other.
Figure 10.
 
Digital scale for conjunctival bulbar redness of the nasal and temporal eye regions resulting from the proposed pipeline.
Figure 10.
 
Digital scale for conjunctival bulbar redness of the nasal and temporal eye regions resulting from the proposed pipeline.
Figure A1.
 
Layout of the clockwise numbered ROIs for the focus evaluation as defined by our clinical experts. In addition, examples of the predicted bounding box and the iris are visualized to provide a better overview.
Figure A1.
 
Layout of the clockwise numbered ROIs for the focus evaluation as defined by our clinical experts. In addition, examples of the predicted bounding box and the iris are visualized to provide a better overview.
Figure A2.
 
Shows tiled-up and overlapped scleral segmentations to visualize the present ocular surface regions in all images. Hereby the number of occurrences per pixel was normalized by the total number of images. Also, the ROIs used for focus determination were drawn on the image with red rectangles to evaluate their positions.
Figure A2.
 
Shows tiled-up and overlapped scleral segmentations to visualize the present ocular surface regions in all images. Hereby the number of occurrences per pixel was normalized by the total number of images. Also, the ROIs used for focus determination were drawn on the image with red rectangles to evaluate their positions.
Figure C1.
 
Shows (A) the initial visit of a study I subject and (B) the same subject imaged after 25.6 weeks, while in both cases the tiling redness extraction approach is visualized. Here, the initial Efron grading was 1 and the follow-up was graded with Efron 2.
Figure C1.
 
Shows (A) the initial visit of a study I subject and (B) the same subject imaged after 25.6 weeks, while in both cases the tiling redness extraction approach is visualized. Here, the initial Efron grading was 1 and the follow-up was graded with Efron 2.
Figure C2.
 
(A) The initial visit of a study I subject and (B) the same subject imaged after 1.9 weeks. In both cases the tiling redness extraction approach is visualized. Here, the initial Efron grading was 1 and the follow-up was graded with Efron 2.
Figure C2.
 
(A) The initial visit of a study I subject and (B) the same subject imaged after 1.9 weeks. In both cases the tiling redness extraction approach is visualized. Here, the initial Efron grading was 1 and the follow-up was graded with Efron 2.
Figure C3.
 
(A) The initial visit of a study I subject and (B) the same subject imaged after 11.9 weeks, while in both cases the tiling redness extraction approach is visualized. Here, the initial Efron grading was 3 and the follow-up was also graded with Efron 3.
Figure C3.
 
(A) The initial visit of a study I subject and (B) the same subject imaged after 11.9 weeks, while in both cases the tiling redness extraction approach is visualized. Here, the initial Efron grading was 3 and the follow-up was also graded with Efron 3.
Table 1.
 
Number of Primary Diseases Per Study
Table 1.
 
Number of Primary Diseases Per Study
Table 2.
 
Number of Image Splits Per Dataset and Study
Table 2.
 
Number of Image Splits Per Dataset and Study
Table 3.
 
Efron Gradings of All Eyes
Table 3.
 
Efron Gradings of All Eyes
Table 4.
 
Evaluation of Segmentation Models on Test Data
Table 4.
 
Evaluation of Segmentation Models on Test Data
Table 5.
 
Number of Efron Gradings for the Digital Scale of All 152 Eyes and the 17 Follow-Up Visits Per Bin
Table 5.
 
Number of Efron Gradings for the Digital Scale of All 152 Eyes and the 17 Follow-Up Visits Per Bin
Table A1.
 
Percentage of Focus ROIs Marked in the Case of Showing Ocular Surface From Visual Inspection of All 505 Images for the Focus Determination
Table A1.
 
Percentage of Focus ROIs Marked in the Case of Showing Ocular Surface From Visual Inspection of All 505 Images for the Focus Determination
Table C1.
 
Redness Extraction Values for Additional Follow-Ups
Table C1.
 
Redness Extraction Values for Additional Follow-Ups
Table C2.
 
The Median, the First Quartile (Q1) and the Third Quartile (Q3) of Each Box Shown in Figure 9
Table C2.
 
The Median, the First Quartile (Q1) and the Third Quartile (Q3) of Each Box Shown in Figure 9
Table D1.
 
Additional Evaluation of Segmentation Models on Test Data
Table D1.
 
Additional Evaluation of Segmentation Models on Test Data
×
×

This PDF is available to Subscribers Only

Sign in or purchase a subscription to access this content. ×

You must be signed into an individual account to use this feature.

×