Generate Photo-realistic Avatars with DCGAN

In this report we explore the feasibility of using DCGAN (Deep Convolutional Generative Adversarial Networks) to generate the neural model of a specific person from limited amount of images or videos, with the aim of creating a controllable avatar with photo-realistic animated expressions out of such a neural model.

Here DCGAN holds the promise that the neural model created from it can be used to interpolate arbitrary non-existent images in order to render a photo-realistic and convincing animated avatar that closely resembles the original person.

Context of this research

This is part of a long-term open-source research effort, called the HAI project. The grand vision of the HAI project is to build a crowd-driven and open-source knowledge base (in the spirit of the Wikipedia) for replicating our 3D world, enabled and enriched through the use of neural models.

This report is a first step in this direction, using human faces as the subject matter for a detailed study. We want to verify whether DCGAN can be used to build a satisfactory neural model of human faces through unsupervised learning, so that we can proceed to create an avatar out of such a neural model.

A broad survey of some DCGAN and related papers that precede this report can be found here, which helps to explain the thought process that leads to this report.

Why Neural Model

The neural model of a physical object differs from the traditional 3D graphic format in that it does not explicitly express the precise geometric structure of a physical 3D object, but rather it is a collection of many levels of visual features encoded in the layers of a certain artificial neural network.

The recent advancement in the DCGAN technology shows that it is capable of learning a hierarchical image representation from 2D image samples, unsupervised. This leads to the possibility of extending and then using it as a representation for static or dynamic physical objects. Such a neural network-based object representation holds long-term benefits in the following sense:

  1. Given the tremendous recent progress in artificial neural networks (e.g., CNN, RNN, LSTM, dilated causal CNN, etc.), having the physical objects also represented in the same form will greatly simplify multi-modal learning (e.g., with text, sounds, etc.) involving physical objects.
  2. The vector representation generated by DCGAN can be used to support various useful operations, such as the vector arithmetic that maps to meaningful operations on the images, as described here.
  3. Supervised learning can be performed based on such a vector representation in order to acquire the mapping between visual objects and other modalities.

While the above areas will not be covered in this report, they do explain our motivation for studying the neural model approach.

Our Challenges

There have been many prior experiments where DCGAN is used to generate seemingly realistic random bedroom scenes, faces, flowers, manga, album covers, etc.

Here we seek to push it further to answer the following questions:

  1. Can DCGAN be used as the basis for generating the neural model of a specific object?
    Here as opposed to simply interpolating from many random training examples to generate broadly natural-looking images, we seek to use DCGAN to create a neural model for representing the dynamic views of a specific physical object, and also find practical applications for the method.
  2. Can photo-realistic and animated facial expressions of a specific person be created out a trained DCGAN model?
    Here we choose human faces as our subject matter for the experiment, because such images or videos of human faces are abundant and easy to acquire. And since we are sensitive to even minor deformities in human faces, the bar here is naturally high.
  3. How far can we push DCGAN to work reasonably well with training datasets that are very small and with little varieties (since they are all about the same person when building an avatar)?
  4. Do we gain any advantage by training DCGAN on video samples?
The Setup

The setup for our experiments is as follows:

  1. Hardware: Amazon AWS/EC2 g2.2xlarge GPU instance (current generation), 8 vCPUs, 15GB memory, 60GB SSD.
    Note that I have also looked into using the Google Cloud Platform (GCP) for such experiments, but unfortunately GCP does not offer GPU instances at this time. Some comparisons of using AWS/EC2 and GCP for running DCGAN jobs can be found here.
  2. Software:
    1. Amazon AWS EC2 OS image (AMI): ami-75e3aa62 ubuntu14.04-cuda7.5-tensorflow0.9-smartmind. This AMI contains most of the configuration needed for this experiment, such as TensorFlow.
    2. TensorFlow 0.9, Python 2.7, Cuda 7.5
    3. DCGAN implementation: a Tensorflow implementation of DCGAN based on the paper Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks by Radford, et al.
Experiment #1: baseline reference

Here we run a well-tested DCGAN experiment as a baseline reference for additional experiments. Later we will also seek to use the result here for solving some problems encountered.

  1. Dataset: 109631 celebrity photos scrapped from over the internet. Photos have been cropped and aligned (by the center point between the eyes) programmingly. This dataset can be found here.
  2. Test parameters: mini-batch size = 64, Adam optimizer, beta1 (momentum of Adam optimizer) = 0.5, learning rate = 0.0002.
  3. Result was similar to what other DCGAN experimenters have published earlier, where generally convincing faces are created by the generator. The following were produced at 20% of one epoch by randomly sampling the Z vector.
Experiment #2: tiny dataset of a specific person

Here we want to find out how small a dataset that we can get away with. Note that the dataset that we use in the baseline above is rather large (> 100000 photos) and contains photos of diverse identities. In this experiment we want to push it to the other extreme by using a very small dataset of one specific person.

The problem with small data set has been studied by Nicholas Guttenberg, where the space could become exceedingly jagged for gradient descent to converge to fixed points. So first let's see what happens in this experiment, and we will then seek remedies for it.

  1. Dataset: 64 photos one specific person. Several variations were tested:
    • Manually scrapped photos of the Duchess of Cambridge, Kate Middleton. Photos were manually cropped with no additional processing done. See Figure 2: Figure 2. Manually scrapped and cropped photos with natural background.
    • Same as above, but the background was manually removed.
    • Stock photos of a multitude of expressions of one person, with consistent background, lighting, hair style, and clothing. See Figure 3. Figure 3. Stock photos with better consistency. Courtesy of <a href='http://faestock.deviantart.com/' target='blank'>faestock</a>
  2. Test parameters: Adam optimizer, beta1 (momentum of Adam optimizer) = 0.5, learning rate = 0.0002. Mini-batch sizes range from one, 9, to 64 (i.e., each epoch contains only one batch).
  3. Result: the well-reported problems with model collapse or instability (see the Guttenberg or OpenAI articles) were observed, as such no usable result was achieved.
    More specifically, the model usually falls into one of the following states:
    • Symptom A: the entire model collapsed to very small number of samples which render nearly perfectly, but all other points in the Z representation lead to highly mangled images. The discriminator loss values stay consistently low, while the generator loss values stay very high. Longer training does not help.
    • Symptom B: randomly sampled points from the Z representation show that all such points generate nearly identical image M, and M changes from one mini-batch to another. The discriminator loss values stay consistently low, while the generator loss values stay high. Longer training does not help.

Some previously suggested remedies for such problems include dealing with the batchnorm, applying regularization, adding noise (see the Radford paper), or using the minibatch discrimination technique (see the Salimans et al paper).

We proceed to test out the minibatch discrimination technique as suggested in the Salimans et al paper. More specifically:

  1. Mini-batches are created not as disjoint subsets of the full batch, but rather as staggered sets with some overlap so a training sample could belong to two minibatchs.
  2. Human judgement is applied in the creation of minibatch so that samples in a minibatch tend to be similar to each other. While this kind of intervention might seem like an anathema to the ideal of unsupervised learning, this is in fact not an issue when this experiment is extended to deal with video training samples, where adjacent frames in the video are naturally similar to each other. It is in fact my opinion that video is a more natural training samples for our goal here.

We got somewhat better result from applying the minibatch discrimination, where it seems less prone to model collapse, but the generated images remain largely mangled (see Figure 4).

Figure 4. Mangled faces from Experiment #2, with the 9 images randomly sampled from the Z representation

Experiment #3: tiny dataset of a specific person - aligned

While inspecting the mangled faces from Experiment #2 (see Figure 4), it was suspected that perhaps the alignment in the training samples is important. Our observation seems to indicate that no amount of training can get rid of the problem.

This suspicion is reinforced by the fact that the celebA dataset shows no such a problem, where the images have been programmingly processed to have the center point between the two eyes aligned to the center of the image, and also rotated to have both eyes on a horizontal line.

As such we repeated Experiment #2, but this time we have the training samples further processed to have the same alignment as the celebA dataset.

Following are details regarding this experiment:

  1. Dataset: the same Kate dataset as in Experiment #2 is used, except that images are manually aligned.
  2. Test parameters: same as Experiment #2.
  3. Result: it was surprising that simple alignment has resolved the mangled image problem mentioned in Experiment #2, and we were able to produce reasonable result with small training dataset. With care we were able to general usable images.
Experiment #4: video dataset

Using a suitable video as a source of training samples is desirable for the following reasons:

  1. It helps to alleviate training difficulties arise out of inconsistency in lighting, hair style, makeups, clothing, image background, aging, etc. Such problems are prevalent in scrapped datasets.
  2. Videos are abundant and easy to acquire.
  3. Videos provide critical timing information that helps to make animation more natural. Note that this aspect is left for future research.
  4. Videos provide information about temporal patterns that is otherwise unavailable. Note that this aspect is left for future research.
  5. The implied object persistence in a video (i.e., the object recognized in frame N is likely to be the same as the similar object recognized in frame N+1). This affords us a kind of anonymous unsupervised labels that opens up a lot of new research directions. Note that this aspect is left for future research.

Following are details regarding this experiment:

  1. Dataset: based on a video published to YouTube on 2015 regarding an Interview with Adele. Segments of the video was taken at the rate of one frame per second, manually cropped with Adele front and center, manually aligned and reduced to 178x218 (WxH) pixel size, same as the celebA dataset. Care was taken to ensure that the resulting images preserve the original order. Here we intentionally choose a low frame rate so that we can clearly see whether DCGAN is effective in filling in the gaps. We also intentionally choose a celebrity so that it is easier to judge the likeness of what DCGAN generates. Adele is chosen because she tends to have a wider variety of expressions.
  2. Test parameters: same as Experiment #1.
  3. Result: following is a set of 64 images randomly sampled from a trained model.
    This examples shows that the DCGAN model has acquired a multitude of expressions from the training samples, and is able to generate reasonable interpolation from the training samples.
Experiment #5: building a reusable model

Firgure 6a. Initial state of Experiment #5 Test #1, showing images randomly sampled from the UFM. Here we use the trained model from Experiment #1 as our UFM. Figure 6b. End state of Test #1 with the Kate dataset trained with the UFM for only 75 minibatches. Here 64 images are randomly sampled from the trained model. Figure 7a. The initial sampled images from Test #2, which indicates an empty model, i.e., we are training from scratch here. Figure 7b. The sampled images at the end of Test #2 after 75 minibatches, same as in Test #1, but shows much inferior results (see Figure 6b.)

Here we seek to answer two questions:

  1. Can we build some sort of a Universal Face Model (UFM) that captures the essence of all human faces, so that it can be reused for training new face datasets, with the hope of achieving reduced training time and better image quality? For otherwise each time when we want to create an avatar for a new person then we will always have to retrain DCGAN from scratch, which typically takes quite some time.
  2. How much of performance gain can we get with such a reusable universal face model?

Experimental setup

  1. The model trained on the celebA dataset (see Experiment #1), which contains >100k distinct faces, is used as our target UFM. This UFM was trained on around 20000 minibatches, each minibatch containing 64 photos, and took nearly whole day to complete on our low-end GPU instance.
  2. A set of 178 Kate Middleton photos is used as our New Person (NP) dataset. The photos in NP has been cropped and aligned in exactly the same manner as those photos in the UFM dataset.
  3. All sampled images are taken at 64 fixed random points in the Z representation.
  4. Test #1: this test is initialized with the UFM model, then trained on the NP dataset for 75 minibatches. Figure 6a shows the sampled images at the start of this test, which does not yet contain any influence from the NP dataset. Figure 6b shows the sampled images at the end of the test, which shows fairly reasonable likeness to the target subject. This training took 298 seconds on our low-end GPU instance.
  5. Test #2: this test starts with an empty model, then trained on the NP dataset for 75 minibatches, same as in Test #1. Figure 7a shows the sampled images at the start of this test, which contains just noise. Figure 7b shows the sampled images at the end of the test, which is still quite rough. This training took 3 minutes.

We draw the following conclusions from this test:

Figure 8. Four images showing the noise texture problem which was found only on the learning-from-scratch approach. Image A is an image generated when learning from scratch. Image B is the magnified version of A, highlighting the noise texture in the generated image. Image C is generated from a model based on the celebA which does not have the noise texture problem. Image D also does not have the problem, which is trained the same way as Image A, except that it starts with UFM (i.e., celebA's trained model).

  1. The trained model from the celebA dataset (see Experiment #1) appears to be an adequate Universal Face Model. Intuitively this makes sense, since the model's generator have been trained from large number of distinct faces, so it must contain layers of features common to most faces. As such we should be able to take advantage of it by basing the NP dataset on top of it.
  2. The limited tests above shows that training based on UFM can be several times faster than training from scratch.
  3. We have observed some sort of noise texture in the generated image. Referring to Figure 8, we analyze this as follows:

    • Case #1: model is trained from scratch using the Adele video dataset (which contains only 81 images). The generated images when inspected up close appear to be noisy (see Figure 8.a for normal-size view, and Figure 8.b for a magnified view). It seems that no amount of further training is able to remedy this.
    • Case #2: the noise texture problem does not happen in the baseline experiment (see Figure 8.c for a sample).
    • Case #3: the noise problem also does not happen when the model is trained based on a UFM (see Figure. 8.d for an example). Here the UFM is used as the initial model, the same Adele dataset is then trained under exactly the same parameters as in Case #1.

    The Radford paper has demonstrated similar phenomenon in the Figure 3 of that paper, where it states repeated noise textures across multiple samples such as the base boards, and this was attributed to under-fitting. Given that the size our training dataset tends to be relatively tiny, it is not surprising that we observe such under-fitting problem. Case #2 escaped this problem due to the sheet size and variety of its training dataset. In Case #3 we show that by using the Case #2 model was as a starting point (i.e., treating it as a UFM) the under-fitting problem is alleviated.

Create Animated Expressions

Once we have a good model trained out of the photos or videos of one specific person, it is then possible to create photo-realistic animated expressions out of it. A simplistic method for this is as follows:

Figure 9A. A two-second 64x64 animation showing Adele cracking a smile from a slight pout. This was created by interpolating between two points in the Z representation of a trained DCGAN model. Figure 9B. A more elaborate sequence showing Adele looking up then smile. This was generated the same way as Figure 9A Figure 9C. Adele growls. This was generated the same way as the other images.

  1. Visually inspect a gallery of generated images and identify the source S (e.g., a neural expression) and target T (e.g., smiling) expressions of interest.
  2. Plot a straight line in the Z representation to traverse from S to T, find the set of points {P} that divide the line into equal parts, then writes out the generate images for {P} along the path. In the examples here we chose to divide the line into 20 parts and the resulting images are animated with a 2 second duration, animated both in the forward and backward direction.
  3. Use a tool (e.g., ffmpeg, or the Python library MoviePy, etc.) to combine the images {P} into an animated GIF file or video.
  4. Visually select the resulting animated GIF files for those that show the best effect.

The Figure 9 series are examples of animated expressions synthesized entirely from a trained DCGAN model. All animations were created using the same parameters and setup. The jump in the animation is caused by the looping of the GIF file.

The above is a kind of brute force and simplistic animation, since it completely ignores patterns in human facial expressions. The animated expression created out of a straight line in the Z representation doesn't always look convincing, thus needs to carefully screened. There are many possibilities for creating better animation out of a trained model, and these are left for future research.

Conclusions

To answer the questions asked at the top of this report (see the Our Challenges section):

  1. Can DCGAN be used as the basis for generating the neural model of a specific object?
    Our experiment shows that a reusable Universal Face Model helps to reduce training time, as well as alleviate the noise texture problem that comes from the use of small datasets. We believe that given the promising result, further research for objects beyond faces is warranted.
  2. Can photo-realistic and animated facial expressions be created out a trained DCGAN model?
    We were able to create animated expressions from a trained model, which demonstrated that the basic premise is sound. The resulting quality is somewhat low, in part due to the limited computing power available at the time of the experiment. We have reason to believe that much higher graphic quality and full range of realistic expression is within reach using this approach.
  3. How far can we push DCGAN to work reasonably well with training datasets that are very small and with little varieties (since they are all about the same person)?

    We managed to produce reasonable models using DCGAN with as few as 64 training samples. For human faces it seems that alignment is the key. However, we did observe the following problems, which have been reported by other DCGAN experimenters:

    1. Model collapse. We have frequently observed the collapse of the model during training, where the model generates only one or very few distinct images, and the moving average of the gLoss/dLoss value (computed from dividing the loss value for the generator by that of the discriminator's) has exploded. In contrast, This has not being observed in the baseline (i.e., Experiment #1) which is trained from a very large and diverse dataset, where the gLoss/dLoss value tend to stay reasonably stable throughout training.
    2. Degradation from further training. Training for longer does not always create better results.
  4. Do we gain any advantage by training DCGAN on video samples?
    Data acquired from video samples is naturally consecutive (assuming that the sampling frame rate used is not too low), where adjacent frames are largely identical. We can look at this from several angles to see what we gain from this:

    1. From the perspective of faster convergence during training or achieving higher quality results, the benefit is not yet clear. More experimentation are needed in this area.
    2. From the perspective of generating an avatar with finer expressions, such mouth movements while the subject is talking, we believe that the use of video sample is a must. This is because only video can provide the timing and detailed information needed for recreating finer expressions. This is a topic for further research.
  5. Can DCGAN be used to create high-quality and dynamic model of a specific object beyond faces?
    The experiments given in this report is our first attempt in applying DCGAN to create a neural model of faces, and not just for creating some sort of interpolated images. We are hopeful that there are many interesting researches in this direction beyond just modeling faces.

Our contributions

We make the following contributions in this work:

  1. We show a practical application of DCGAN in the form of building the neural model of a specific physical object from images or videos without supervision. Such a neural model can conceivably be used as a form of representation for certain visual aspects of a specific physical object.
  2. We show that by interpolating in the Z representation of a trained DCGAN model, it is possible to synthesize photo-realistic animation of the specific object used for training.
  3. We show that with the approach outlined above and using human faces as the subject matter in this study, we are able to synthesize photo-realistic animated expressions from limited training dataset with good result. With further work such a technique can conceivably be extended to create a full photo-realistic avatar of a person.
  4. We point out a practical bottom-up approach for applying the DCGAN technology. I.e., instead of using DCGAN to interpolate from large amount of images with many varieties, we can instead focus on one very specific object and create a detailed model out of it. For the longer term we can then accumulate and extend many such detailed models towards certain practical usage.
  5. We offer an approach to alleviate the inherent under-fitting problem associated with very small training dataset through the use of a re-usable DCGAN model. In our experiment we built a Universal Face Model (UFM) which represents a prototypical neural model for all human faces, we then use such a UFM to train new and small dataset for building a new avatar. We show that the use of UFM helps to alleviate the said under-fitting problems.
  6. We show that through the use of a Universal Model the training time for the neural model of a new subject can be substantially reduced. For the purpose of building avatar, this means that when we wish to create an avatar for a new person, then using UFM will take only a fraction of the time needed when compared with training from scratch.
  7. We point out a promising future research direction where videos can be used as the training dataset for DCGAN.
Going Forward

It is my belief that DCGAN and its extensions can be used for building the neural models of our physical world, unsupervised, from images and videos.

Here we take the first baby step using human faces as the subject matter for study, and have managed to build neural models for human faces using DCGAN, then subsequently use such neural models to create photo-realistic and animated expressions.

While we have not yet built an avatar with full range of expressions, we have demonstrated that the approach holds a great deal of promise. Viewing strictly the perspective of creating an avatar using the DCGAN approach, there are still much to be investigated. More specifically:

  1. Add controlling element so that another program can actually treat the neural model as a dynamically controllable avatar. So far in this work we have demonstrated the fundamentals of synthesizing piecemeal expressions out of limited images or videos which is necessary for building an avatar, but we nonetheless have not yet provided the dynamic control mechanism.
  2. Generate images in much higher image resolution. Current experiments operate on training images at the resolution of 200x200 pixels or less, which is fairly grainy.
  3. Add automatic segmentation capability for learning from parts of an image.
  4. Automatic separation of spurious factors, such as lighting, clothing, background, hair style, etc.
  5. Learn from videos in higher frame rate. In the Experiment #4 above we use a frame rate of one per second. This is in part due to the preliminary nature of this research, and obviously we have lost a great deal of information with such coarse sampling.
  6. Transfer of feature or expressions. The Radford paper has demonstrated the possibility of operating on the vector representation that results in the transfer of visual features between images, such as adding sunglasses or a smile. It would be interesting to demonstrate that we can make avatar A smile like avatar B through such principle.
  7. Create avatar with full range of controllable expressions through unsupervised learning.
  8. Acquire finer expressions through learning, such as the those around the mouth when the subject is talking. Currently we are able to handle only relatively simple expressions, such as going from neural to smiling, or turning of the head, etc. Dealing with fine expressions likely will require us to extend DCGAN further, perhaps into the temporal domain.
  9. Perform multi-modal learning, e.g., acquire the relationship between the speech/text and facial expressions.
  10. Convert neural models to 3D models suitable for VR/AR devices or 3D printers. This is an exciting area, since perfecting this would afford us a unsupervised method for creating large amount of dynamic and realistic 3D models needed for supporting a rich VR/AR world.
Resources
  1. Goodfellow et al., Generative Adversarial Nets, 2014.
  2. Radford et al., Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, 2016.
  3. Salimans et al., Improved Techniques for Training GANs, 2016.
  4. Nicholas Guttenberg, Stability of Generative Adversarial Networks, 2016.
  5. John Glover's blog, An introduction to Generative Adversarial Networks (with code in TensorFlow), 2016.
  6. Casper Kaae Sønderby's blog, Instance Noise: A trick for stabilising GAN training, 2016.
  7. A good introduction to DCGAN from OpenAI
  8. StackOverflow, How to auto-crop pictures using Python and OpenCV
  9. The DCGAN implementation used in this report: a Tensorflow implementation of DCGAN, contributed by Taehoon Kim (carpedm20).
comments powered by Disqus