Generate Photo-realistic image from sketch using cGAN

Samples from the pix2pix paper

In this report we study the possibility of building the neural model of human faces using cGAN.

In my last experiment Generate Photo-realistic Avatars with DCGAN I showed that it is possible to use DCGAN (Deep Convolutional Generative Adversarial Networks), the non-conditional variation of GAN, to synthesize photo-realistic animated facial expressions using a model trained from limited number of images or videos of a specific person.

This report is a follow-up on the general idea, but this time we want to use the cGAN as described in the paper Image-to-Image Translation with Conditional Generative Adversarial Networks (referred to as the pix2pix paper below), and apply it for the purpose of synthesizing photo-realistic images from the black-and-white sketch images (either Photoshoped or hand-drawn) of a specific person.

Overall this report is an empirical study on cGAN, with an eye towards finding practical applications for the technology (see the Motivation section below).


Our long-term goal is to build a crowd-contributed repository of 3D models that represent the objects in our physical world. As opposed to scanning and representing physical object using the traditional mathematical 3D model representation, we want to explore the idea of using a representation based on the Artificial Neural Networks (ANN) for it ability to learn, infer, associate, and encode rich probability distribution of visual details. We call such ANN-based representation the neural model of a physical object.

The intuition behind studying cGAN here is that if cGAN is capable of generating realistic visual details when given only scanty information, then perhaps it has in fact constitute an adequate representation for many visual aspects of a complex physical object.

The reasons for choosing human faces in this study are because:

  1. Such images or videos are abundant and easy to acquire.
  2. Human facial expressions are fairly complex and a good subject for study.
  3. We are instinctively sensitive to images of human faces, thus the bar for the experiments is naturally higher than using other types of images. This will allows us to spot problems in the experimental results more quickly.
  4. Human faces involve precise geometric relationship among facial features (eyes, nose, etc.). As such, it is a good candidate for studying what it takes in order for a generative system like cGAN to discover feature structure at the instance-level, and not just probability distribution at the population level.
  5. There are arguably more practical applications for human faces.

As a first step towards the long-term goal stated above, we choose to use cGAN for building the neural model over the faces of a specific person. This differs from the typical GAN applications which tend to apply towards a wide variety of images. If successful then we will proceed with using cGAN or its extension on other types of physical objects.

Goal of Experiments

Using human faces as the subject matter for a series of experiments, we seek to answer the following questions:

  1. How far can we push cGAN to fill in satisfactory details when only scanty information is provided in the input image, using relatively small training dataset.
  2. Overall is cGAN suitable for use as the basis to build the neural model of a specific person's face, representing the multitude of visual details regarding this person's face. For example, can cGAN be trained to accommodate artifacts in the test input image, to recover from aberrant input, recover from missing parts, etc.
  3. Is the cGAN neural model of a person transferable to another person.
  4. The usefulness of building a universal cGAN neural model for all human faces.
The Setup

The setup for our experiments is as follows:

  1. Hardware: Amazon AWS/EC2 g2.2xlarge GPU instance (current generation), 8 vCPUs, 15GB memory, 60GB SSD.
  2. Software:
    1. Amazon AWS EC2 OS image (AMI): ami-75e3aa62 ubuntu14.04-cuda7.5-tensorflow0.9-smartmind.
    2. Torch 7, Python 2.7, Cuda 8
    3. cGAN implementation: Torch implementation for cGAN based on the paper Image-to-Image Translation with Conditional Generative Adversarial Networks by Isola, et al.
  3. Datasets: the following apply unless noted otherwise.
    1. The input images (either for training or testing) are gray scale images manually created from the ground truth color images by an artist using various Photoshop filters. A few images are hand-drawn either to copy a ground truth photo intended for training, or drawn free-hand without using a ground truth photo intended for testing.
    2. Images are cropped to 400x400 pixel size.
    3. Images are manually aligned to have the center point between the two eyes at a fixed point in the 400x400 frame.
  4. Training parameters. Unless noted otherwise, all training sessions use the following parameters: batch size: 1, L1 regularization, beta1: 0.5, learning rate: 0.0002, images are flipped horizontally to augment the training. All training images are scaled to 286 pixel width, then cropped to 256 pixel width.
Baseline Experiment: building the AJ Model

Here we attempt to build a neural model of American actress, filmmaker, and humanitarian Angelina Jolie (referred to as the AJ Model below) using a relatively small training dataset.

The ground truth images (or target images) are color photos of the Angelina Jolie, manually scrapped from over the Internet. The input images (for either training or testing) are manually processed by an artist using Photoshop, by converting them to black-and-white with filter effects. All input images have one particular style of effect applied, which we shall call the Style A effect. The center image in Figure 1a shows the typical output from the testing phase, which are sampled from the trained model using the input image at left.


  1. Training dataset: 21 image pairs are used for training. The input images are all created from the ground truth images by an artist using a particular Photoshop effect (see Figure 1a, referred to as the Style A effect below).
  2. Test dataset: 10 images from the training set, all have the Style A effect applied (see Figure 1a).
  3. Training parameters: default. Training time: 4 hours, 2000 epochs.


Regardless of the small size of the training dataset, overall cGAN does a good job converting from the black-and-white input images to the color version, with a great deal of convincing shading and colors that very closely match the ground truth photos. Some observations:

  • During testing cGAN is able to infer reasonable shading in the output image (e.g., center image in Figure 1a), such as the pinkish hue around the cheeks, even if most of the facial area in the input image (at left) is mostly flat gray without any gradient. .
  • The color and texture of the lips are convincing, even if those are given little details in the input image.
  • In most of the ground truth photos the subject shows the application of eye shadows. In Figure 1a even though the input image shows little sign of the eye shadows (especially under the eyes), nonetheless cGAN was able to apply convincing eye shadows.
Experiment #1: the case of too many effects

Figure 1a. Images shown from left are: input, output, ground truth. The input image shows the Style A effect, which are applied to all training and test images in the baseline experiment. Figure 1b. The input image at left shows the Style B effect applied which creates strong contrast. This effect is not applied to any of the training samples. The output image in the center shows the problem of the woodblock printing look where there is little gradient. Figure 1c. This is the same test sample as in Figure 1b, which shows that the woodblock printing look in the output image has disappeared after including relevant training samples, and it now appears photo-realistic.

While the baseline experiment above shows good result, the input images are in fact entirely uniform with only one type of effect applied to reduce them to black-and-white. Here we want to find out what would happen when the input images include effects in several varieties.

Test 1.A

The AJ Model from the baseline experiment is used, but the test samples contain some input images with Style B effect applied (see left image in Figure 1b).


  1. Figure 1b demonstrates a case where the test input image has the Style B effect applied, which is not present in any of the training images. The resulting output images generated (see center image in Figure 1b for an example) show a kind of woodblock printing effect that displays only a few colors with almost no gradient. This problem is further studied in the Test 1.B below.

Test 1.B

Figure 1d. Tests with two additional effects (applied to the two black-and-white images) show the same problem: individual effect needs to be included in the training samples in order to achieve satisfactory result, otherwise the sampled output image tend to appear washed out.

  1. Training dataset: same as in Experiment #1, but augmented with more training samples that have the Style B effect applied to the input images. Total 68 training pairs.
  2. Training parameters: trained from the model derived in Experiment #1, 6.5 hours training time, 1000 epochs, other parameters same as Experiment #1.

Figure 1c shows the result from this test, where the same test image now appears photo-realistic without the woodblock printing effect.

Further tests with additional effects (see left images in Figure 1d) show a general pattern, i.e., test samples with new effect (i.e., that the model was never trained on samples with the effect applied) tend to show poor result, and including such samples in training resolves the problem.

While the result above is not entirely surprising, we do wish to find ways to make the model more tolerant to a wider variety of effects, so that we don't have to retrain cGAN on every new effect.

Experiment #2: the case of mutilated faces

Figure 2a. This demonstrates the problem of missing features, where the input images at left is missing part of the nose (top-left) or an eye (bottom-left). No sign of recovery is observed when sampled from the AJ Model. Figure 2b. After including such samples in the training, sampling from the new model show that it is capable of recovering the features to some degree. Figure 2c. One view of the training process, showing the an out-of-place eye in the intermediate output image Figure 2d. Another view of the training process, showing duplicated and out of place eyes in the intermediate output image.

In this experiment we want to find out whether it is possible to recover missing facial features in the input images. This is of interest here because as a neural model we would want it to be able to infer missing information from partial or altered observations.

Test 2.A

We created a set of new test input images applied with Style A or Style B effects, then manually modified to have certain facial features erased. These test images are then used to sample the AJ Model from Experiment #1 (which has been trained with Style A & B effects). The result is shown in Figure 2a, which demonstrates that the model is unable to recover the facial feature omitted in the input images.

Test 2.B

The two samples shown in Figure 2a, which were used only as test samples earlier, are now included here for training.

Figure 2b shows the result after 4000 epochs of training. Note that in the top row of Figure 2b, the output image (at center) has been repaired by cGAN with a somewhat acceptable nose, though smaller than in the ground truth photo. The output image (at center) in the bottom row has been repaired with an eye that seems to be a copy from the ground truth photo, but it is larger and not quite in the right place.

A curious effect is observed (using pix2pix's Display UI tool) during the training phase of this experiment, where successive snapshots show the missing part moving and resized around the face, with no clear sign of convergence. Figure 2c and 2d give a glimpse of the phenomenon.

The problem was eventually resolved by turning off the random jitter operation which was applied by this cGAN implementation by default. The random jitter operation essentially add some small randomness in the cropping and re-sizing the images, which seem to work well for other types of subject matters. Our conjecture is that such an operation does not appear to work well in this particular experiment in part because we are extremely sensitive to the precise relative positioning of facial parts, so while we tolerate it in other types of subject matters (e.g., street scenes, building facades, etc.), it become much more noticeable with faces.

With the random jitter removed it can then be observed during training that missing parts are being repaired to near perfection. This of course does not mean much, unless it can also do so with new test images. This is further investigated below.

Test 2.C

Figure 2e. An input image with a different defect (i.e., missing the right eye) is used to sample against the model trained in Test 2.B, and the output image (at center) shows no repair made to the right eye at all. Curiously, the model chooses to repair the good left eye and replaced it with a larger version. Figure 2f. Top row shows that a missing left eye is repaired well during trainning. When a new test input image (bottom row left image) is given, the trained model repair it with an eye that looks off. The model trained from Test 2.B (referred to as the model-2B below) is observed to repair the missing nose and eye satisfactorily during training, the next question is whether such repair is transferable in the following sense:
Figure 2g. This example demonstrates cGAN's ability to repair the same defect across different identities. When an input image that has been trained to repair the missing left eye based on a different photo, the output (at center) shows a partial repair with a very faint and mismatched eye.

  1. Given a test input image of the same person with the same defect, whether model-2B can achieve satisfactory repair.
    The answer is sort of. Figure 2f top row shows that cGAN has learned to repair a missing left eye during training. When the new input image (bottom left) with the same defect is given, the trained model repaired it with a left eye that seemingly belong to another person.
    This bring up an interesting question regarding whether cGAN as it is today is able to learn structured relationship among features in the image, such as the mirror symmetry in 3D of the two eyes.
  2. Given the image of the same person with a slightly different defect, whether model-2B can achieve satisfactory repair.
    Figure 2e shows an input image (at left) with a missing right eye (was a missing left eye in Figure 2b) which is used to sample against model-2B. The resulting output image (at center) shows no repair made to the right eye at all. However, it can be observed that the model choose to repair the good left eye and replaced it with a larger version.
    At this point it is a mystery how this has happened, and whether it is possible to find a solution.
    One conjecture is that this cGAN implementation's flip parameter is in play here, but this remains to be verified.
  3. Given the image of another person with the same defect, whether model-2B can achieve the same repair.
    Figure 2g shows the result of this test, where first a model is trained to repair a missing left eye, then we use the input image of a different person (i.e., the left image in Figure 2g) to sample again the trained model. The result (center image in Figure 2g) shows that the model was able to put up a faint left eye in the correct position that does not match the right eye in shape or color.
    This result is expected, since model-2B is trained on AJ's images, it thus represents the probability distribution of her facial features alone. When the test image of a different person is used to sample against model-2B, the rendered repair will naturally yield AJ's features.
    In order to pass this test, the system must be capable of learning the constraints on the relationship among features, e.g., the fact that the two eyes must match in certain ways. This is a topic beyond the scope of this report.
Experiment #3: from art to photo-realism

Figure 3a. The image at the far left is hand-drawn sketch (courtesy: Michelle Chen) based on the ground truth photo at far right. Figure 3b. The hand-drawn black-and-white sketch at left was intentionally made to have a somewhat different expression from the ground truth photo. The output image is blurry and feature-wise closer to the ground truth photo that the model was trained on than the input image.

All of the black-and-white input images used in the experiments above are processed by an artist using Photoshop. This means that such an input image is a precise reduction of a ground truth photo, it thus retain a great deal of precision regarding the position and arrangement of many visual features in relation to its ground truth counterpart.

In this experiment we seek to find out if an input image is entirely hand-drawn, with all of the imprecision of a human hand, then can such art work be converted to a photo-realistic image, like those other Photoshop-processed samples that we have seen before. This is somewhat similar to the handbag example in the pix2pix paper with hand-drawn outline, but we get to check it out using human faces.

For this experiment we asked an artist to find a photo of Angelina Jolie, draw several black-and-white sketches by hand based on the photo, such image pairs (the original photo and the sketch) are then used for additional experiments. The sketch was made using graphite pencil on paper, then scanned and converted a 400x400 jpeg file, which manual retouch in Photoshop as needed.

Test 3.A

Here we use the photo-sketch pairs as new test samples against the model from Experiment #2, which was trained on Style A and Style B effects in the input images, but never on imprecise hand-sketched samples (let's call this hand-drawn effect Style C). Figure 3a shows the initial result. It is not unexpected since the model has never been trained on this style.

Test 3.B

Here we include some hand-drawn samples in the training phase to derive a new model (referred to as model-3B below). When the input image in Figure 3b is sampled against model-3B, The result (center image in Figure 3b) shows much improvement than what's in Figure 3a, though still somewhat blurry, possibly due to insufficient training. The output image is judged to be too similar to the ground truth photo used to train model-3B, so this experiment should be repeated with more samples.

Experiment #4: the case of mistaken identity

Figure 4a. This is a test where male image is used to sample against the AJ model. The black-and-white image at left is the input, far right is the ground truth photo, and the center image is the output which appear to have picked up some features of Angelina Jolie. Figure 4c. When sampling the AJ model using a Byonce image, the output image picks up AJ's skin tone.

Given that we have built a neural model of Angelina Jolie (referred to as the AJ Model), how useful is it when trying to apply it to other people? Since a neural model trained exclusively on one person represents the probability distribution of this person's facial features, it is expected that applying the AJ Model to another person's photos will get us somewhat reasonable result, but with some limits.

Figure 4a shows the result of sampling American actor and producer Brad Pitt based on the AJ Model. As expected the result (center image) shows somewhat reasonable colors and shading, but it also picks up softer feminine lines, a less stubby beard, and Jolie's brown hair color.

Similarly in Figure 4b, when sampling against the AJ Model using an input image of the American singer, songwriter and actress Beyoncé results in an output image (at center) that picks up the lighter skin tone of Angelina Jolie.

From the perspective of building neural models for human faces, it would seem that it is appropriate to have a separate model for each individual of interest. It would be interesting to see the creation of a hierarchy of such models, where the top one represents a model for all human faces, the bottom leaf models represent specific individuals, and those in between models represent groups of people (such as by race, by distinct features, etc.). With a well-designed mechanism we might be able to derive much efficiency in the training and storage from such a hierarchical structure of many models.

Experiment #5: the case of decomposing faces

In this experiment we want to study how to decompose a face into parts, so that each part can be manipulated individually.

Why is this important? This is because if a neural model is composed of parts that can be learned without supervision, and that such parts can be treated as shared features across sample instances, then it is possible to achieve a kind of one-shot learning.

For example, assuming that cGAN is able to generate facial parts (e.g., eyes, noses, etc.) during its training process (just like a typical deep CNN could), and that the two noses in two photos activate the same neuron in cGAN, then you can say that this neuron now represents an anonymous concept of nose.

If we now attach a text label 'nose' to the image of a nose in photo A, then the system would know right way that the 'nose' label is likely also applicable to all those other noses in other photos. So here we have achieved a sort of one-shot supervised learning through the common nose neuron mentioned above.

If we use cGAN as the basis for implementing the neural model in question, then it would mean the following:

  1. High-level features created through the training process should be used as the basis for shared features across face instances.
  2. We cannot have entirely separate neural models (i.e., completely separate cGAN instances) for two individuals, since in that case there is then no way to create a generalized concept of a feature (e.g., a nose) across individuals.

This is a topic which will be explored further in a separate post.

Experiment #6: from photo to imitated artwork

Figure 6a. A trained cGAN model is used to convert an unseen test color photo to produce the black-and-white style effect (center).  The result is quite similar to the one manually created by an artist (right).

In this experiment we seek to apply cGAN in the other direction, by mapping from color photos to sketches of a certain style.

This turns out to be very easy, at least for those manually applied Photoshop effects that we have used in the previous experiments.

We use a training dataset of 48 pairs of images, where all the black-and-white images are manually created using the same Photoshop effect applied to the color photos. cGAN is then trained to map form color photos to black-and-white. After training for one hour we use the trained model on a separate set of color photos for testing. Figure 6a shows a typical test input image (left, in color), which is converted to black-and-white output image (center) by the trained cGAN model. The result is deemed very good when the output image is compared with another image (right) converted manually by an artist using the same Photoshop effect used to create the training dataset.

So with this it is then possible to have use cGAN to bootstrap our own experiments as follows in order to reduce the amount of manual work:

  1. Manually prepare a set of black-and-white photos S1 of the target effect (as seen in experiments 1-5) using a tool such as Photoshop.
  2. Use S1 as a training set, but map it in the other direction to have a cGAN model learn how to reproduce the target effect.
  3. This effect cGAN can then be used as a tool to generate more dataset without the involvement of manual work by an artist.

This technique should be applicable to many types of datasets that involves some sort of straight-forward information reduction.

It would be interesting to see how far we can push it for generating more artistic and less faithful effects, such as caricatures, etc.


In this report we have conducted a series of empirical studies on the possibility of using cGAN as the basis for building a neural representation of human faces, with an eye towards applying the same technique to other types of physical objects in the future.

This particular flavor of the Conditional GAN allows us to map from an input image to another image, which gives us a handle to use cGAN in many ways.

Following is a summary of observations made from this study.

  1. cGAN has great potential serving as the basis for modeling complex physical objects such as human faces. It can be used to model the visual features of either an individual or an entire population.
  2. Alignment of faces in the images turned out to be quite important. For best results all faces should be aligned and resized uniformly, with the two eyes on a horizontal line at roughly the same position. Deviation leads to the many problems. This is not surprising, since in a way such alignment normalizes the location of the facial features and makes training simpler. Following are problems observed from unaligned samples:
    1. In the output images the color could appear faded out, diffused, or not very photo-realistic.
    2. For the missing feature problem studied in Experiment #2, unaligned faces (such as those highlighted in Figure 2c/2d/2e) are much harder to train.
  3. Training cGAN to repair mutilated facial images (e.g., missing an eye or part of the nose), especially across identities, proved to be challenging. This is not unexpected, since this highlights the following issues:
    1. Need to find a way to learn structured features at the image instance level. For instance, repairing an eye in an particular image likely cannot be achieved by applying the average eye from all training samples.
    2. Need to find a way to manage the relationship at the feature level across different models. For example, an eye from Jolie's face model shares some features from an eye from another person's model, but at the same time they are also not exchangeable.

The experiments described above were conducted with very limited amount of data samples, as well as limited model training time. The observations and suggestions made above are quite preliminary, and further study is warranted.

Going Forward

There are several possible applications of the cGAN technology (or its extension) that we want to explore in separate posts:

Figure 7. Training cGAN to convert a normal photo (at left) to a depth map (at right).

  1. Use cGAN for achieving monocular depth perception.
    Here we seek to find whether cGAN can be used to convert a normal photo to a depth map, where the grayness in each pixel represents the distance to the target (see Figure 7). We want to know whether cGAN can be trained to achieve this, or will it merely learn to paint the likeness of a depth map that cannot be generalized beyond training samples.
  2. Use cGAN for image segmentation.
    Here we seek to find out whether it is possible to teach cGAN to segment and extract part of an image by learning from examples.
  3. Use cGAN for one-shot learning.
    Here we seek to get cGAN to learn a concept from just one case of supervised learning. For example, after cGAN has processed a number of face images unsupervised, adding a label 'nose' to part of one sample will allow it to correctly point out the noses in all samples.

I want to show my appreciation to the pix2pix team for their excellent paper and implementation. Without which this work would have been much harder to complete.

I also want to show my gratitude to Fonchin Chen and Michelle Chen for offering to do the hand-drawn sketches, as well as helping with the unending process of collecting and processing the images needed for the project.

comments powered by Disqus