The present study showed that applying the pix2pix cGAN to color-coded Scheimpflug corneal maps can efficiently generate images with plausible quality in N, K, and E classes. To date, no previous study has reported the use of the pix2pix cGAN in this type of image translation task. The basic pix2pix framework has been used for medical image denoising, reconstruction, or segmentation, rather than for amplifying the original dataset.
32,33 Although the background annotation noise can easily be avoided by exporting Pentacam data from the beginning, rather than using screenshots, we preferred to utilize the more commonly used conventional annotated images to facilitate retrospective studies of stored images by interested researchers and allow comparison between ground-truth annotated images and their synthesized annotated counterparts. Fujioka et al. demonstrated that, with increasing epoch number, the final image quality increased. However, they postulated that overlearning may occur if learning extended beyond the ideal number of learning iterations.
34 Previous studies used subjective scores by experienced raters to select generator epochs producing the best image quality.
6–8,35 However, this method may introduce bias, due to inter- and intraindividual rater variations. We used the newly developed FID score as an objective metric to gauge the generated image quality after each learning iteration, instead of relying on human observers. We confirmed that continuing learning beyond the optimal epoch can result in a lower-quality image-generating performance. Our method could provide synthetic Pentacam corneal tomography images with plausible subjective and objective qualities in keratoconus, early keratoconus, and normal cornea domains, comparable to other studies implementing pix2pix networks for other image datasets.
5,6 Yu et al. documented better PSNR and SSIM when using the pix2pix framework than the CycleGAN.
5,36 We also postulate that unpaired training with the CycleGAN does not have a data fidelity-loss term; therefore, preservation of small abnormal regions during the translation process is not guaranteed. Rozema et al.
37 used a stochastic eye model to generate realistic random Gaussian distortions, which were superimposed on the anterior corneal surface to simulate statistical and epidemiological properties of keratoconus. Their model was capable of generating an unlimited number of keratoconus biometry sets. However, parameter reduction in their model comes at the expense of information loss which reduces parameter variability. This modeling is different from our approach, which maps high-dimensional images into a latent space where high-level features are extracted from individual pixels. This latent space is used to morph original images into new analogous images under constraints imposed by the loss functions and the source image domain. This permits unlimited synthesis of convincingly realistic and phenotypically diverse images that retain high-level feature similarity. In DCNNs, there is always a trade-off between the training dataset size, model complexity, nature of the data, and performance.
7 Our results showed that increasing the training dataset size with synthesized images resulted in robust classification performance and decreased model overfitting, improving the network's generalizability to unseen test data, consistent with other studies using different datasets.
7 In our dataset, traditional augmentation resulted in poor classifier performance, possibly due to the introduction of inconvenient spatial variance, which may prevent the classifier from identifying the most influential image pixels, resulting in model underfitting. This was inconsistent with the findings of other studies
7,19,38 using different datasets and more training iterations and hyperparameter modulation. Another perspective could be that with traditional augmentation strategies, the abnormality classifier may find it relatively difficult to approximate the noise function in augmented images as compared to approximating the image features generated by GAN. In this respect, GANs provided a more generic solution. However, as we used a limited number of training iterations with no remarkable changes to the model architecture, further analysis is required to interpret the performance of the VGG-16 classifier, which was beyond the scope of our research. We demonstrated the model overfitting to the smaller training datasets and the limited value of implementing class weights during training to counteract the class imbalance. These findings strongly support the usefulness of the pix2pix cGAN for data augmentation, providing instantaneous high-quality synthetic images of the required amount. The overall subjective evaluations of the synthesized images of all image classes were promising. The presence of artifacts may be partially due to the low amount of data and the transposed convolutions used in the decoder part of the generator architecture.
39 Subjective assessment of synthesized images was satisfactory in agreement with objective evaluation results and other reports.
7,35 Also, the quality of the automated annotation algorithm was promising, giving the synthesized images a realistic appearance that challenged human readers in discrimination between original and synthesized images. By simulating conventional maps, the annotation helped human graders in the classification of synthesized images successfully. A better annotation algorithm may optimize similarity with original maps in the future. Our study had some limitations. Generative models are always limited by the information contained within the training set, and how it captures the variability of the underlying real-world data distribution. Given that our data for both image generation and classification were sourced from the same institution, it remains an open question as to whether the results reported here can be generalized to data from other institutions, which may have different population statistics. Additionally, the drawback of the FID score is that the ground-truth samples are not directly compared to the synthetic samples. The score indirectly evaluates synthetic images based on the statistics of a collection of synthetic images compared to the statistics of a collection of real images from the target domain. The absolute VGG-16 classifier performance could potentially be improved by additional architecture and hyperparameter searches, but we focused on assessing classification metrics trends rather than optimizing end performance in this study. Finally, the reliability of synthetic images may be improved by collecting data from more cases.