Monocular Depth Perception with cGAN

Figure 1. Example of an image pair used to train cGAN. The color image at left comes from a standard camera, while the black-and-white image at right is a depth map where the brightness of each pixel is an indication of distance from the same perspective.

Is it possible to train a cGAN (Conditional Generative Adversarial Networks) model for monocular depth perception?

If the answer is yes, then it would mean that we have a way to allow an artificial system to acquire some basic concept about distance in the physical world, learning from only flat images, starting with nothing.

The type of training proposed in this report goes as follows:

  1. First we train an instance of the cGAN on many pairs of static images of various objects or environment, where the first image in the pair is a full-color photo, and the second image is a depth map of the color photo (see Figure 1 for an example). There is no particular relationship between any two image pairs.
  2. After the training result is satisfactory, this trained cGAN can then be to used to convert a unseen photo to a reasonable depth map for the photo. In other words, this cGAN would have achieved monocular depth perception.

Some may think that the premise above is questionable, so let's get these out of way first:

  1. If the system already have the equipment for creating the depth maps needed for training, then why would the system need to learn about it?
    One possible reason is that once you have trained this system to detect depth on its own, then you can deploy this system very cheaply many many times, without the relatively expensive depth detection hardware (assuming that you don't need the higher precision, etc.).
    Another reason is that it is cool to show that cGAN can do this with no pre-programmed logic, practicality aside.
  2. Is it inevitable that we will need some depth-sensing hardware, at least for the training phase?
    Not necessarily. It is conceivable that we can train such a system in a virtual world, such as the DeepMind Lab, where both the standard camera view and the depth information can be acquired without special hardware. If such a virtual world is sufficiently rich in details, then perhaps it is possible to apply the depth-sensing capability learned there to the physical world.
Goals of experiments

In this report we will investigate the above premise with a series experiments. Here we seek to get preliminary answers for the following questions:

Figure 2a. This is a sample image pair from the Regime-V dataset, where artificially-created virtual scenes contain only simple objects, lighting, and perfect depth maps. Figure 2b. This is a sample image pair from the Regime-R dataset, which are real-world scenes containing complex and unpredictable objects, many with faulty depth maps acquired through depth-sensing devices.

  1. Can depth perception be trained from monocular static images, using a method like cGAN which was not invented to deal with depth perception at all? Will cGAN turn out to just learn to paint perfect depth maps during training and then fail miserably during testing?
  2. Which training regime is easier: training from clean and simple virtual scenes (see Figure 2a, referred to as Regime-V, V for virtual), or training from complex and messy real-world scenes (see Figure 2b, referred to as Regime-R, R for real-world)?
  3. Which training regime is more generalizable? It other words, which of the following will give us better result?
    • First train on virtual scenes from Regime-V, then test the trained model using real-world scenes from Regime-R.
    • First train on real-world scenes from Regime-R, then test the trained model using virtual scenes from Regime-V.

It is worth mentioning that this is a preliminary study on whether such a research direction warrants further investigation, as such it does not contain large-scale experimentation using vast amount of datasets. Judgement on the quality of the result is done with careful analysis but also somewhat subjective, and no attempt is made to support it using precise experimental numbers as is typically done in formal research papers.

Context of this research

In my last experiment Generate Photo-realistic Avatars with DCGAN I showed that it is possible to use DCGAN (Deep Convolutional Generative Adversarial Networks) to synthesize photo-realistic animated facial expressions using a model trained from limited number of images or videos of a specific person.

In another report we investigated the idea about building the neural models of human faces using cGAN (as described in the paper Image-to-Image Translation with Conditional Generative Adversarial Networks, referred to as the pix2pix paper below), and apply it for the purpose of synthesizing photo-realistic images from the black-and-white sketch image (either Photoshoped or hand-drawn) of a specific person.

Overall such studies go toward the purpose of achieving the long-term goal of building complex and realistic 3D objects or environments from on interactive verbal commands (ref: How to build a Holodeck)

Experimental Setup

The setup for the experiments is as follows:

  1. Hardware: Amazon AWS/EC2 g2.2xlarge GPU instance (current generation), 8 vCPUs, 15GB memory, 60GB SSD.
  2. Software:
    1. Amazon AWS EC2 OS image (AMI): ami-75e3aa62 ubuntu14.04-cuda7.5-tensorflow0.9-smartmind.
    2. Torch 7, Python 2.7, Cuda 8
    3. cGAN implementation: pix2pix: a Torch implementation for cGAN based on the paper Image-to-Image Translation with Conditional Generative Adversarial Networks by Isola, et al.
  3. Datasets: two datasets are used in this report. It should be noted that these two datasets display the depth in different gray scales, where one marks near-far as black-white, while the other as white-black.
    • The Foucard dataset, contributed by Louis Foucard, is a Python Blender script for creating large numbers of randomized 3d scenes and corresponding sets of stereoscopic images and depth maps. See Figure 2a for a sample image pair. This dataset is used as our Regime-V dataset. It contains only a handful of geometric objects, with very simple lighting and colors. Since the scenes are virtual, the depth maps are perfectly generated without the artifacts and inaccuracies with real-world-based depth maps acquired through depth-sensing devices.
      The original dataset comes with stereoscopic views for the color images. In this report we have randomly selected the left-eye view for the experiments.
      Figure 3. Example of an unsuitable depth map image (at right) where there is large area of black artifact at the top.
    • The SUN RGB-D dataset (direct link to a zip file, as well as a 6.9GB processed version shared by Brannon Dorsey) from the SUNRGB-D 3D Object Detection Challenge of the Princeton Vision & Robotics Labs, which is used as the Regime-R dataset for our experiments. A portion of the depth map images in the Princeton dataset are deemed of too low-quality (see Figure 3) and detrimental to the training of cGAN, so they are manually excluded.
  4. Training parameters. Unless noted otherwise, all training sessions use the following parameters: batch size: 1, L1 regularization, beta1: 0.5, learning rate: 0.0002, images are flipped horizontally to augment the training. All training images are scaled to 286 pixel width, then cropped to 256 pixel width, with a small random jitter applied during the process.
Experiment #1: Training with virtual scenes

In this experiment we use 1500 image pairs from the Foucard dataset for training, and 150 image pairs for testing. Training time is 4 hours using the given setup.

Figure 4a. Example of a good test result, where the trained system generates a depth map (at center) that matches closely with the one from the Foucard dataset (at right). Figure 4b. Example of a not-so-good test result, where the trained system generates a depth map (at center) that misjudges the depth of the cone at left side of the scene.

Evaluation using testing samples shows that the system has learned to convert the input color images to match very closely with the corresponding depth maps from the test dataset. In particular (see Figure 4a), the system has learned to ignore the lighting pattern on the walls, as well as the colors and shading of the objects, which play no role in deciding the depth.

One area where the system shows weakness is in judging the depth, which in some cases are either less than perfect, or simply incorrect. Figure 4b demonstrates a case where the cone at left is not rendered correctly in the output image that cGAN generated (at center) from the test input (at left). Since such deficiency usually improves with more training samples, we judge cGAN as overall being capable of learning the depth map, and also quite efficiently.

Experiment #2: Training with real-world scenes

In this experiment we use 86 image pairs from the SUN RGB-D dataset for training, and 198 image pairs for testing. Training time is 12 hours using the given setup. The small number of samples used here was due to the difficulty in having to manually screen out low-quality depth maps in the dataset, as well as limitation in the available resources at this time.

Figure 5a. The output image (at center) is generated by a trained cGAN from the input (at left), which is considered generally quite good in resolving to the correct depth. Figure 5b. Test output (center image) shows good promise, with lots of room for improvement. Figure 5c. An animated GIF created through <a href='' target='_blank'>Depthy</a> using a depth map learned by cGAN, which demonstrates reasonable depth effect.

Figure 3 shows a typical faulty depth map, prevalent in the SUN RGB-D dataset, which are excluded from training. Such samples are however kept for testing as a benchmark for comparing with the generated depth maps.

Evaluation using test samples shows mixed result. In some cases the model has learned to convert the input color images to match very closely with the corresponding depth maps from the dataset. The example in Figure 5a demonstrates that the model has learned to ignore the lighting pattern on the walls, the colors and shading of the objects, and depth perception is overall quite good.

Figure 5b shows a test result of intermediate quality. Note that the depth map from the SUN RGB-D dataset (at right) contains a large area of black artifact at the upper-left corner, while in comparison the depth map (center image) produced from the photo (left image) by the trained cGAN shows more reasonable result in the same area. However, the chairs are somewhat incompletely rendered.

It is worth noting that separate experiments conducted by Brannon Dorsey at the Branger_Briz digital R&D lab with a 3500 sample training dataset using the same pix2pix implementation does not suffer from the problems shown in Figure 5b, even without manually screening out low-quality samples. So it seems to be the case that such problems were a result of under-training, and that low quality training samples can be overcome given sufficiently large dataset.

Overall we believe that the model has demonstrated the ability to learn to produce generally correct depth maps, and the problems observed are likely due to under-training, or perhaps the quality of the training depth maps.

Experiment #3: Extending from virtual to real scenes

In this experiment we use the model trained in Experiment #1 based on Regime-V using virtual scenes, and apply it towards a Regime-R test dataset with real-world scenes.

Figure 6. Test result for this experiment is universally bad, where the depth map (center image) generated from a real-world scene (left image) by the model trained on virtual scenes does not show any comprehension of depth.

The quality of the test result from this experiment is judged as extremely low, where the trained Regime-V model shows little comprehension of depth in real-world scenes (see Figure 6). It can be seen in Figure 6 that the output (center image) resembles merely a blurry gray-scale version of the input image, still preserving the irrelevant light and shadow, and pixel level gray level has no correlation with depth.

Obviously the virtual scenes from the Regime-V training dataset do not contain sufficient cues to allow the model to cover real-world scenes. For further research it would be interesting to find out what kind of virtual scene dataset would be sufficient for training a model that performs satisfactorily with test real-world scenes. For example, if we train a cGAN agent inside DeepMind Lab's 3D learning environment, would such an agent transplants well to a physical robot navigating in the physical world?

Experiment #4: Extending from real to virtual scenes

In this experiment we use the model trained in Experiment #2 based on Regime-R using real-world scenes, and apply it towards a Regime-V test dataset containing virtual scenes.

Figure 7. The output depth map generated (center image) shows poor result when compared to the ground-truth depth image (at right). This is the case for the entire Regime-V test dataset.

The quality of the test result from this experiment is judged as very poor (see Figure 7). Overall the entire 158 test samples are all like this, where shades of those phantom furniture from the Regime-R dataset are almost visible.

Obviously the samples from the Regime-R and Regime-V training datasets are sufficiently different that the result is not transferable between them.


We find that training cGAN for monocular depth perception from static image pairs in likely feasible, and the experiments should be expanded with much larger training datasets with more varieties. Training on Regime-R with real-world photos take many times longer than training on Regime-V dataset, likely due to complexity of the real-world scenes, as well as the poorer quality of the depth maps acquired through depth-sensing devices.

The experiments above was conducted entirely with existing datasets contributed by others. Aside from problem with the quality of real-world depth maps, there are also problems with the inconsistent depth map color schemes used in different datasets, which makes it difficult to use them together without further processing. With the advent of low-cost depth-sensing devices such as the Google Tango, higher-resolution Kinect, or suitable smartphone-based depth-sensing apps, it would be interesting to expand the experiment using self-generated datasets targeting specific areas (e.g., human faces or poses, etc.).

So how could this be put to practical use? While in theory such capability in depth perception can be applied towards something like robotic navigation, in its current form it is perhaps too primitive be competitive with other more matured ANN-based approaches. However, if the robustness of this approach can be demonstrated in further studies, then it is conceivable that it can be used to add low-precision 3D perspective to the vast amount of photos or videos available out there.

Going forward

There are several possible research directions going forward:

  1. Test with much larger datasets to confirm the result.
  2. Test with outdoor scenes, animals, people, and faces.
  3. Test with stereoscopic datasets.
  4. Test with videos, perhaps involving extending cGAN into the time domain, or borrowing some ideas from the VideoGAN.
  5. Test in a rich interactive virtual 3D world, such as the DeepMind Lab. Also learn how to correlate depth perception with agent actions and consequences in such a virtual world.

Can cGAN be trained to perform automatic image operations, such as erasing background in photos, align and resize faces, etc? We shall explore this topic in a separate post.


The idea of applying cGAN to depth perception came originally from Brannon Dorsey at the Branger_Briz digital R&D lab, who also graciously shared his dataset and model for use in the experiments here.

I want to show my appreciation to the pix2pix team for their excellent paper and implementation. Without which this work would have been much harder to complete.

I also want to thank Louis Foucard and the Princeton Vision & Robotics Labs for making their datasets available.

Last but not least, I want to show my gratitude to Fonchin Chen for helping with the unending process of collecting and processing the images needed for the project.

  1. Isola et al,Image-to-Image Translation with Conditional Generative Adversarial Networks, 2016.
  2. pix2pix, a Torch implementation for cGAN
  3. The Louis Foucard dataset
  4. The SUN RGB-D dataset ( from the SUNRGB-D 3D Object Detection Challenge of the Princeton Vision & Robotics Labs
  5. Vondrick et al, Generating Videos with Scene Dynamics, 2016.
  6. Kendall et al, PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization, 2016. Video, article, source code (Caffe, Tensorflow).
comments powered by Disqus