###### Introduction

Our ultimate goal is to generate 3D models out of textual or verbal commands. Here we tackle (for now) the simpler problem of generate 2D images, before moving on the more complex problem of dealing with 3D models.

There have been some recent research that are relevant to the generation of 2D images that can also handle lighting, poses, perspective, emotions (for facial images). In particular, DCGAN shows promise as a way for discovering high-level image representation through unsupervised learning, which is highly relevant to our goal here. In this post I will survey these various researches in order to find a direction towards the stated goal.

This post is part of the How to build a Holodeck series, which is a long-term crowd-driven open-source project (abbreviated to the name HAI below) that I am working on. The posts in the series serve as a working document for sharing ideas and results with the general research community.

About the HAI project While we have a fairly long-term goal for the crowd-sourcing HAI project where we want to generate 3D models out of textual descriptions, here as the first step we want to reduce it to simpler core problem of generating 2D images from textual descriptions.

###### Case #1: CNN+DNN

This paper Learning to Generate Chairs, Tables and Cars with Convolutional Networks proposes a method for learning from 2D images of many types of objects (e.g., chairs, tables, and cars, created out of 3D models for experimentation), and is then able to generate realistic images with unseen styles, views, or lighting. The method is based on a convolution-deconvolution (abbreviated to CNN+DNN below) architecture.

The following show a model from the paper. Goal of the model is to reconstruct the given image and segmentation mask, when given the input parameters. The input parameters include the model identity defining the style, the orientation of the camera, and other artificial transformation (e.g., such as rotation, translation, zoom, stretching horizontally or vertically, changing hue, changing saturation, changing brightness.)

This model works as follows:

1. (Layers FC-1 to FC-4) The input parameters are independently fed through two fulling connected layers, then concatenated and fed through two fully connected layers to generate a shared high dimensional representation h.
2. Layers FC-5 and uconv-1 to unconv-4 then generate the image and segmentation mask in two independent streams from h.
3. The network is trained by minimizing the error of reconstructing the segmented-out chair image and the segmentation mask.

The challenges here are:

1. Can a high-level representation be learned through such a model? Put in plain language, if we ask the model to interpolate between two know chair styles (or other parameters such as orientation, etc.), will we get something that looks like a reasonable chair?
2. How extensible is this method to natural training images that may have random background, inconsistent lighting, etc.

From chairs to faces

zo7/deconvfaces: is a Python implementation of the paper above, posted by user Michael D. Flynn. The said method was adapted for interpolating from the images of human faces with interesting results.

Interpolating between multiple identities and emotions: same lighting and pose (i.e., facial orientation).

Relevant resources for the deconvfaces experiment:

1. The Extended Yale Face Database B: the uncropped and cropped versions are supported by the zo7/deconvfaces implementation above.
3. Additional experimental result from applying deconvfaces on the Yales Face Database B, posted by Michael Flynn on imgur and YouTube
4. Blog by Michael D. Flynn.

Following are some experimental results reported by Michael D. Flynn.

Interpolate between mixed identities and emotions, based on the Radboud Faces Database,

Interpolate on lighting, based on the Extended Yale Face Database B.

Interpolate on poses, based on the Extended Yale Face Database B.

Significance
From the perspective of the HAI project, this method is significant in the following areas:

1. It is able to acquire high-level representation of images. Such high-level representation is essential to the goal of performing various manipulation order to meet a user's request.
2. It is able to generate reasonable interpolation from given images. This is a sign that the acquired representation is an effective one. This type of capability will allow HAI to generate infinite variations of the target image in order to meet a user's request.
3. It is able to perform some form of extrapolation. From the paper:

The chairs dataset only contains renderings with elevation angles 20◦ and 30◦, while for tables elevations between 0◦ and 40◦ are available. We show that we can transfer information about elevations from one class to another.


Such capability in extrapolation, or generalization, in critical in reducing the amount of learning that is needed.

4. The deconvfaces experiments with human faces show that realistic lighting and poses can be interpolated. This shows promise that it is perhaps possible to generate realistic 3D models out of such 2D images.
###### Case #2: DCGANs: unsupervised learning of image representation

This is a class of CNNs called deep convolutional generative adversarial networks (DCGANs), which can be trained on image datasets, and show convincing evidence that its deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. It has also shown great promise for generating realistic looking images.

Following are realistic images of bedroom scenery generated (from the paper) (https://github.com/Newmu/dcgan_code)

While the images above look nice, how do we know that it is meaningful? Again, by looking at how good it interpolates between images we can get a sense of whether it has learned a good image representation.

Following is an experimental result showing a series of interpolation between 9 randomly selected images. The significant part here is that all of the images look reasonably realistic, and the in-between transitions (say, from TV to window, or a windows created from a wall) look plausible. This is as opposed to previous methods which might just create a blurred morph between images.

As described in the paper, DCGAN is capable of learning a hierarchy of image representation through unsupervised learning. What does this mean, and why is it important for the HAI project?

As mentioned above, our goal is to allow realistic images (and eventually 3D models) to be created and manipulated through verbal commands. In order to allow images to be manipulated in complex ways toward such a goal, an image cannot be treated merely as a collection of pixels. But rather an image somehow has to be transformed into a hierarchy of parts, and moreover such a transformation has to be learned mostly unsupervised by the system itself.

Vector arithmetic

Following is an example that demonstrates the image representation learned by DCGAN, where the representation allows DCGAN to apply sunglasses on a female face from what it has learned from other types of faces, even if it has never seen a woman with sunglasses before.

This is an indication that:

1. DCGAN has learned, unsupervised, how to break down the training images into meaningful parts (i.e., facial features are separate from sunglasses); and
2. DCGAN is capable of performing operations based on such a representation (e.g., applying the sunglasses on a male face to a female face) and reasonable reasonable result. This is in many ways reminiscent of how Word2vec is able to learn word representation from text, so that vector operations on its representation like Brother-Man+Woman would yield Sister.

So in a sense, DCGAN already can be viewed as a precursor of the HAI system, where (with some additional training about verbal commands) it is perhaps possible to instruct it to manipulate faces towards what a user wanted.

Related experiments

1. Here is a DCGAN implementation based on TensorFlow.
2. Here is a blog Image Completion with Deep Learning in TensorFlow showing how DCGAN can be used for image completion, where part of the image can be erased or added in a realistic manner.

Significance

DCGAN is important for the following reasons:

1. It is capable of generating a representation from training images, unsupervised.
2. It is capable of generating realistic images
3. It is capable of generating realistic interpolations
4. The Word2vec-like vector operation capability (see the woman-with-sunglasses example above) is intriguing, since it points to the possibility of a rich representation that can do much more than a simple one.

Open questions

1. Can DCGAN support some form of extrapolation? Can the image completion example above be considered a form of extrapolation, and how can it be further extended?
2. How far can we push the vector operation on this representation? How can we extend it to 3D?
###### Case #3: Generate Images from Text

This method uses the DCGAN approach for generating realistic images from text. Following are partial result displayed in the paper:

How it works

It trains a DCGAN conditioned on text features encoded by a hybrid character-level convolutional recurrent neural network. Both the generator network G and the discriminator network D perform feed-forward inference conditioned on the text feature.

Significance

1. Needless to say, this feels pretty much like a primitive Holodeck, where the system creates the target image based on textual descriptions.
2. Furthermore, this system is also capable to separate style from content (i.e., foreground and background information in the image).
3. Capable of pose and background transfer from query images onto text descriptions.
4. Can be generalized to generate images with multiple objects and variable backgrounds.
###### Case #4: Filling in Details with cGAN

How do we create convincing visual details for a specific object from little information?

The 2016 paper by Isola et al, Image-to-Image Translation with Conditional Adversarial Nets demonstrates the use of Conditional GAN to generate convincing details from sketchy information, as shown below from the paper:

which displays six pairs of images, with the left image being the input to cGAN, which then generates the image at right.

How it works

The standard GAN generator G learns a mapping from random noise vector z to output image y, i.e., G:z→y. In contrast, cGAN learns a mapping from observed image x and random noise vector z to y, i.e., G:{x,z}→y.

Significance

1. This give us a starting point for generating details of a specific object or environment on demand.
2. The use of a conditional term (i.e., the x in cGAN) may allow us to have more control in the behavior of the system.
###### Case #5: Synthesizing facial expressions from video/image sample

By applying standard DCGAN but to image or video samples of a specific person, I was able to create some sort of a neural model representation, and then use the model to generate sequences of non-existent photo-realistic facial expressions for the person.

While this does not involve technical innovation beyond the standard DCGAN, it does represent a novel way of applying the DCGAN towards specificity (i.e., the facial expressions of a specific person), and not generality (i.e., for generating arbitrary bedroom scenes).

###### Summary

So we have surveyed a number of promising researches above, from which we might be able to borrow some ideas and extend them further in order to achieve what we needed for the HAI project.

Following are what we have learnt:

1. The DCGANs (and their variations) show a promising general direction for the HAI project.
2. It would seem that it is possible to generate a meaningful image representation out of it, where operations such as interpolation, extrapolation, and vector operations can be carried out with good quality. Such operations are essential for the HAI project.
3. Case #3 demonstrates that it is possible to separate image background (called style in the paper), and apply it to another context. This is critical for image composition in HAI.

Following are the possible future directions to proceed, where we wish to answer the following questions:

1. If we extend upon the generative approach in DCGAN or the conv-deconv methods, but train entirely on the photos of a single person (as opposed to the wide variety approach adopted in most previous experiments) in order to create a highly polished and manipulatible neural model of such a person? More specifically:

1. Can such highly polished neural model of such a person encompasses expressions, poses, ages, and lighting?
2. What does it take in order to transfer such parameters to another identify?
3. Can it learn to remove spurious information, such as the background?
4. Would DCGAN work well on video of a single person? Would the implied object persistence (i.e., the man in frame N and the man in frame N+1 is are most likely the same person) be beneficial to the training process in some way?
5. The Case #3 above shows that multi-modal DCGAN is a promising method in discovered complex relationship between text and images. How can we extend this into the domain of interactive discourse, so that it is possible generate the target image through incremental textual commands?

Such questions will be explored in a separate post.

2. What does it take to be able to* manipulate parts* of an image. For example, in the chair example above the system needs to be able alter only part of it (e.g., the arm rest) per request.
3. Need the capability to reason about relationship between parts of an image, such as understanding even spacing, distance, top/down/left/right relationships, etc.
4. Find a way to accumulate relevant knowledge incrementally, so that we don't have to retrain from scratch every time.
5. What does it take for the system to learn conversational interactions, so that the the target image to be generated through a sequence of interactive textual commands? Case #3 points out a direction, although there are still much to be done. Note that here we wish to have the system learn everything without hard-coded knowledge, if possible.
6. What does it take to achieve one shot learning?
7. What does it take to achieve 3D representation, perhaps in a way similar to what DCGAN made possible for 2D images?

Going forward: we will further pursue and extend the research mentioned in separate posts, including hands-on testing with actual implementations.

###### Other resources

The following are kept here because they are potentially useful, but still pending further investigation:

1. Paper: How Do Humans Sketch Objects?.
Question: can DCGAN be used to create realistic sketches?
2. Paper: Precomputed Real-Time Texture Synthesis with