To be able to take advantage of both labeled (time-consuming to acquire) and unlabeled data (i.e., just the input images without any manual annotations) during training, we also used a semisupervised learning approach based on incorporating the FCN architecture described above into a generative-adversarial-network (GAN) setup. In a traditional GAN,
57 a generator subnetwork
G (designed to generate realistic images from noise) is simultaneously trained with a discriminator network
D (designed to differentiate between real images and those generated by the generator). Because the subnetworks are trained together (in an alternating fashion), the networks “compete” in a minimax game so that ultimately the generator subnetwork can generate realistic images on its own (to fool the discriminator to try to win the competition). In other words, the addition of the discriminator subnetwork is used to help in the
training of the generator subnetwork, but the discriminator subnetwork is not needed once training is complete. (Note that in other applications, one can actually keep the discriminator rather than the generator for purposes of having a starting point for a semisupervised classifier as in Salimans et al.
58 and take advantage of both labeled and unlabeled data for image-level
classification tasks.) However, in our case, rather than just generating realistic-looking images, we wish to take input images and produce a resulting segmentation. Thus we follow a similar approach as previously used
59,60 and a GAN-like framework to help train a segmentation network with both labeled and unlabeled data. An illustration of our approach can be seen in
Figure 6. Intuitively, instead of having a generator
G to take noise and generate realistic-looking images, we use a subnetwork
G to take an input image and produce a resulting segmentation and a subnetwork
D to differentiate (at a pixel level) between segmentations produced by the network
G and reference segmentations (i.e., being able to tell whether the segmentation, at each pixel, was from the generator subnetwork or used as the reference ground-truth). Here both
G and
D for our semisupervised learning approach use the same network structure following the FCN approach in the previous section. Note that after training is complete, as with a traditional GAN, we can just use the trained
G subnetwork (and not use the discriminator subnetwork) in practice (e.g., during testing) so that the inputs/outputs are just like they would be with the FCN described in the prior section.