<![CDATA[All about Machine Learning...]]>http://www.k4ai.com/Ghost 0.8Wed, 10 May 2017 04:50:54 GMT60<![CDATA[臺灣IoT產業戰略藍圖]]>對于臺灣 IoT（The Internet of Things，即物聯網）產業面臨的各種挑戰，諸方專家多有中肯分析，故于此不再贅述。本文目的僅是要借腦力激蕩為臺灣業界找出更進一步的具體可行藍圖，希望能藉此抛磚引玉，引起更多回響與討論。

IoT 領域中的各類產業涵蓋甚廣。本文的解析僅適用於 IoT裝置（device）或系統產業領域，如各類IoT家電，IoT終端設備，無人汽車，無人機，無人工廠，機器人，智慧城市裝置，物流配，等等 。 IoT 晶片或通訊業因性質不同，故不在本文討論範疇。以下試以 AI+IoT (人工智慧＋物聯網) 為主軸來討論臺灣 IoT業界的發展契機。

###### IoT產業的大餅

]]>
http://www.k4ai.com/tw-iot/6e46f95a-fec0-4646-a593-8dc06da2b9adMon, 01 May 2017 00:49:44 GMT對于臺灣 IoT（The Internet of Things，即物聯網）產業面臨的各種挑戰，諸方專家多有中肯分析，故于此不再贅述。本文目的僅是要借腦力激蕩為臺灣業界找出更進一步的具體可行藍圖，希望能藉此抛磚引玉，引起更多回響與討論。

IoT 領域中的各類產業涵蓋甚廣。本文的解析僅適用於 IoT裝置（device）或系統產業領域，如各類IoT家電，IoT終端設備，無人汽車，無人機，無人工廠，機器人，智慧城市裝置，物流配，等等 。 IoT 晶片或通訊業因性質不同，故不在本文討論範疇。以下試以 AI+IoT (人工智慧＋物聯網) 為主軸來討論臺灣 IoT業界的發展契機。

###### IoT產業的大餅

1. 傳統硬體業競爭激烈，收益受到擠壓，盈利不易。
2. 全球各大軟體和互聯網公司挾其龐大的財經與科技優勢大力投入IoT界，臺灣業界難以正面競爭。
3. 高度連綫物聯網的成形是未來的趨勢，IoT裝置除了需智能化之外也需要能密切的和物聯網上的其他裝置整合。因此IoT業競爭的主導權大多決定于軟體系統，而非硬體裝置。臺灣公司多以硬體代工起家，做這類智能化及整合并非臺灣業界的强點。
4. 許多中小型企業無足夠能力自行開發產品所需的先進軟體系統，更難獨立開發大型的物聯網平臺。同時還得面對系統容易過時，IoT標準協議急速演變，不易跟進的難題。
5. 國外開發的智能IoT平臺固然不少，但是上手不易，費用不菲，其功能也不盡符合國内業者所需。更困難的是在過份依附外界IoT 平臺的情況下，會令國内產業難以得到市場的主導權。

###### 策略之一： 臺灣業界合作建立一個共用的IoT整合平臺

• IoT的主導權在於IoT整合平臺而不在於終端的硬體裝置。若無一個臺灣IoT產業能夠掌控，並能整合多方IoT裝置的軟體平臺，則臺灣的IoT產業將難以突破現狀。而大部分的臺灣IoT業者沒有足夠資源獨力開發高品質的整合平臺，或有，也只能占據邊緣市場， 難以達到在國際市場具有競爭力的臨界質量。唯有業界合力開發出切合臺灣產業需求的共用智能軟體平臺，成形後再向外界推廣，才是比較可行的方案。
• 有人以爲使用市面上現有的其他IoT平臺即可，但事實上要與這些平臺達成相容需要投入相當工作量，而這些平臺消長變化率極大，常常需要更新或替換，追隨不易。自行開發的共有平臺不但可以確實掌握主導權，而且可以提供業界單一的整合界面，以簡化開發IoT產品時所需的時間與資源。
• IoT界仍在快速演變中，各類科技也日日出新，業者可以透過這個平臺輕易的與外界最先進的技術接軌。同時，由於全球性的IoT標準通訊和安全協議尚需時日才能達到成熟階段（而且有可能仍將繼續分歧），我們可不必荒廢苦等，在此之前即可經由此一平臺定義合理的緩衝界面，讓臺灣的業界一方面可以全速進行產發，另一方面也能借此和未來的標準協議匹配，以減少對臺灣IoT業界的羈絆與衝擊。
• 本平臺可以為學術界提供開發的界面或其他技術協助，所以能將產業需要的新技術從學界導入平臺供業界共享，以形成高效的技術轉移管道。借此成為和學術界密切合作的橋樑。

• 臺灣IoT產業對開發此一平臺必須有“共創共用”的共識，否則難以創造出能夠和其他世界級大公司抗衡的系統。在此我們須積極鼓勵各公司透過此平臺跨業合作，以達到最大涵蓋度來產生物聯之間相輔相乘的效果。如此不但可以大幅降低產業的部署門檻和開發費用，縮短產品的上市時間，並可以“聚沙成塔”的方式得到最大的市場佔有率。但如果會員廠商各行其是而造成協議及功能上的分歧，則此共享平臺的戰略意義將大幅減弱。
• 開發這個平臺的目的不是為了與重量級的大公司直接競爭，而是爲了爭取IoT業的整合權及主導權，並善用有限資源，以減輕業界採用新科技的負擔。因此此平臺需與臺灣相關產業密切溝通合作，以發展出符合產業需求的特定功能。在平臺設計上應採用基於規則（rule-based）的水平架構（horizontal architecture），並在簡單易用的原則下，特別注重膠合力（glue）和高延展性（high scalability）的功能，做出易擴充而且可以無痛升級的架構，使大家都能輕易上手。必要時此平臺亦可和其他大型IoT平臺進行有限的整合，但切勿在深度功能上與其硬碰硬較勁。
• 避免IoT業界各自開發平臺，造成市場的分裂。舉例來説，系統式的IoT產品（如智慧城市或智慧家居等）多需要一個IoT平臺的支持。如果各廠商都各自開發所需的IoT 系統，則會造成市場混亂，無法互相匹配，進而削弱臺灣業界對外的整體競爭力。
• 避免各自獨立開發原可共用的 AI 模塊。由於近來AI 研究進展極快，新的算法（algorithm）通常在6-12個月就可能會過時，而將AI算法自論文研究發展到商用的階段需要投入相當可觀的人力和時間，並不是一般中小企業所能獨立支持的。我以爲透過這個平臺共同開發對業界普遍有用的共享式 AI 模塊，並讓各公司在其上依需要修改，是一個比較經濟實惠的互利做法。
• 會員廠商在設計產品時就應考慮如何與其他IoT 裝置整合，盡力避免發展封閉式系統，否則將無法得到各個 IoT 裝置因互動而產生的相乘附加價值。
• 善用互聯網和社群功能（如 g0v, github, 等）來進行全業界的交流和討論，以便高效的得到共識，並能迅速的推動合作研發。IoT 和 AI 都是演化極快的領域，合作步調的遲緩將使業界錯失先機。

###### 策略之三： 在不取代個別品牌的原則下，以前述的IoT+AI 平臺為核心將平臺上的諸多產品結合為單一IoT品牌

• 用戶若已經擁有這平臺上的其他產品，則會有信心這個webcam可以輕易的和其他產品匹配互動（如，當webcam看到某特定人物時則能自動為其打開特定區域的電燈/冷氣/電視，等等）。
• 用戶若對這平臺的其他產品有良好的使用經驗，則會傾向于購買能被這平臺所支持的webcam（或其他產品），以維持各裝置間的圓滑互動與運轉。

###### 其他鏈接
1. 本文簡略版，發表于 2017-04-07 經濟日報，趨勢觀察／新IoT時代的整合戰略
2. 個人收集有關人工智慧及IoT多媒體貼版
]]>
<![CDATA[Image Operations with cGAN]]>

In this report we explore the possibility of using cGAN (Conditional Generative Adversarial Networks) for performing automatic graphic operations on the photographs or videos of human faces, similar to those typically done manually using a software tool such as Photoshop or After Effects, by learning from examples.

###### Motivation

A good

]]>
http://www.k4ai.com/imageops/c4ba5a80-cc1e-49c9-a9e4-2669cb3bea7eTue, 27 Dec 2016 20:01:00 GMT

In this report we explore the possibility of using cGAN (Conditional Generative Adversarial Networks) for performing automatic graphic operations on the photographs or videos of human faces, similar to those typically done manually using a software tool such as Photoshop or After Effects, by learning from examples.

###### Motivation

A good part of my research in Machine Learning has to do with images, videos, and 3D objects (e.g., Monocular Depth Perception, Generate photo from sketch, Generate Photo-realistic Avatars, How to build a Holodeck series), so I find myself constantly in need of an artist for the tedious task of create a large number of suitable image/video training datasets for such researches. Given that cGAN is a versatile tool for learning some sort of image-to-image mapping, then the natural question is whether cGAN is useful for automating some of these tasks.

And more importantly, we seek to find out whether we can use cGAN for performing an atypical type of supervised learning that does not involve category labels assigned to the training samples, nor the goal is about learning categorization. Instead, in our experiments the image pairs used for training embody the non-textual intention of the teacher, and the goal is for the system to learn the image mapping operations that achieve the intended goal, so that such operations can be applied successfully to unseen data samples.

For this report we will focus on dealing with human facial images. The image operations investigated include erasing background, image level adjustment, patching small flaws, as well as translation, scaling, and alignment. Experiments regarding removing outlier video frames will be reported in a separated post.

This report is part of a series of studies on the possibility of using GAN/cGAN as the latent representation for representing human faces, which is why the datasets used here are mostly images of human faces.

It should be noted that we have used relatively small datasets in the experiments here, mainly because this is only an exploratory study. For a more complete study, some of the more promising ones should be followed up with larger-scale experiments.

###### Experimental Setup

The setup for the experiments is as follows:

1. Hardware: (unless noted otherwise) Amazon AWS/EC2 g2.2xlarge GPU instance (current generation), 8 vCPUs, 15GB memory, 60GB SSD.
2. Software:
1. Amazon AWS EC2 OS image (AMI): ami-75e3aa62 ubuntu14.04-cuda7.5-tensorflow0.9-smartmind.
2. Torch 7, Python 2.7, Cuda 8
3. cGAN implementation: pix2pix, a Torch implementation for cGAN based on the paper Image-to-Image Translation with Conditional Generative Adversarial Networks by Isola, et al.
3. Training parameters. Unless noted otherwise, all training sessions use the following parameters: batch size: 1, L1 regularization, beta1: 0.5, learning rate: 0.0002, images are flipped horizontally to augment the training. All training images are scaled to 286 pixel width, then cropped to 256 pixel width, with a small random jitter applied during the process.
###### Experiment #1: erasing image background

The purpose of this experiment to see whether cGAN (using the pix2pix implementation) can be used to learn to erase the background of photos, in particular those with human facial images. Figure 1 shows a typical training sample.

Datasets: Photos from previous experiments are recycled for use here, augmented with additional facial images scrapped from the Internet. Such photos are then paired with target images that have the photo background manually erased. Since the manual preparation of the target images is labor intensive, we started with a very small training dataset of only 118 sample image pairs.

Training: The training session took 10 hours of computing using the setup described above.

Evaluating test result: A test dataset of 75 samples are used. The output images generated by the trained cGAN are subjectively ranked into the following five categories (see Figure 2.a-2.e for samples):

1. 5-stars: almost perfect. Total: 18 samples.
2. 4-stars: pretty close, but with some minor problems. Total: 17 samples.
3. 3-stars: so-so result. Total: 24 samples.
4. 2-stars: not good, but still within bounds. Total: 13 samples.
5. 1-star: complete disaster. Total: 3 samples.

The test result above may seem unimpressive, but the following facts should be taken into perspective:

1. The training dataset size is actually quite tiny, considering that typical cGAN training requires a very large training dataset in order for it to properly learn the probability distribution in the dataset. It is in fact quite impressive that such result can be achieved with so little training, which leads one to conclude that the direction is quite promising.
2. The training dataset is also not very diverse (aside from being small). For example, it does not contain photos that are either black-and-white, or much off-center, or close-up with no background, as such it naturally do not test well against those types of photos (which was in fact intentionally included in the test dataset). Adding more training photos of those types has shown to almost always improve the result.

Overall we judge that cGAN could work well for background erasure, provided that a large and sufficiently diverse training dataset (relative to the expected test samples) is used.

###### Experiment #2: image alignment

In this experiment we try to get cGAN to learn how to align images from examples, which involves translation, scaling, and cropping.

We actually do not expect this to work, but ran the experiment anyways just so that we can set it up as a goal post for others to explore it further.

Datasets: Photos from previous experiments are re-used here (which are manually cropped), augmented with the original un-cropped version. See Figure 3a for a sample training image pair.

Since the manual preparation of the target images is labor intensive, we started with a very small training dataset of only 18 sample image pairs to probe the possibilities.

Training: The training session took 4 hours, which shows that cGAN was able to learn to map the image pairs in the training to near perfection. However, testing result is an entirely different matter.

Analysis

All test results look like Figure 3b, where the output image (at right) looks like a jumble of many faces, roughly at the right position, but totally unrecognizable. In other words, this experiment has failed miserably as expected.

So why wouldn't cGAN work for this task? To human eyes the translation and scaling of an image are fairly simple operations, but it is not the case for cGAN. The successive convolutional layers in cGAN is very good at capturing local dependency among nearby pixels, but global operations such as scaling or translation, which affects all pixels equally, is not what cGAN is designed for. The cGAN design is still in its infancy, and as it is right now it does not handle translation, scaling, and rotation well.

So what would it take to make this work? One approach is to incorporate something like the Spatial Transformer Networks to see whether it makes any difference, which we shall explore in a future post.

###### Experiment #3: video processing

In this experiment we want to find out if cGAN can be used for some simple video processing, including background erasure, image tone level adjustments, and making small repairs.

In other words, we seek to find out whether cGAN can be used to learn the intended image operations from a small number of training samples, and then apply the operations to an entire video to achieve satisfactory result.

In this experiment we treat a video pretty much as a collection of images, without taking advantage of its sequential nature (which we shall explore in another report). While it is similar to Experiment #1 for background erasure, there are some differences:

1. We also want to try incorporating other image operations at the same time, such as image tone level adjustment, as well as the patching of minor flaws.
2. The video sequence is essentially images of the same person, which allows us to explore more efficient training methods. Here we apply the technique of drill training to get good result from very few training samples. Drill training refers to the technique of using small number of images for intense training, with the goal of getting good results for this particular training set, but possibly increases test errors against a wider test dataset.

Experimental Setup The setup is the same as Experiments #1 and #2, except the following:

1. Hardware: a standard laptop (Intel i7 CPU with 8GB RAM) is used instead due to resource constraints. This hardware runs the experiment about 10-15 times slower than using an AWS/EC2 g2.2xlarge GPU instance.
2. Model: a pix2pix cGAN model trained in Experiment #1 is used as the initial model.
3. Dataset: video segment of a celebrity interview is used for the test. The video is sampled at 10 fps and cropped to 400x400 pixels at the image center. 1185 frames are selected for this test, most of which (1128 frames) have the same person as the main subject in the image. No manual alignment or color adjustment are applied. Out of these 1185 frames 22 are selected and manually modified for use as the training dataset, with the rest used as the test dataset. The manual modifications done on the training dataset are as follows:
1. Background are erased to show pure white.
2. Image are adjusted using the Photoshop Levels tool for better brightness and contrast.
3. Minor intrusion of other people in the image are erased and patched up as appropriate (see Figure 4a).
4. For trainings we apply the technique of drill training to reduce training time and number of samples required. Overall the training took several days using the non-GPU setup.

Figure 4b shows a 10-second segment of the test result. Our observations of the result are:

1. The background erasure has worked remarkably well, even with only 20 training samples. The outline of the clothing appears a bit wavy, which is due to the difficulty in guessing the outline of the dark clothing over dark background, and the current method provides no continuity between frames.
2. The levels adjustment applied in the training samples, which brightens up the images, are successfully transferred to the test result and makes the resulting video brighter.
3. cGAN can be seen patching up the intruding fingers problem with some of the test samples (see Figure 4c), where only two such patching examples (see Figure 4a) were provided in the training dataset. The result is by no means satisfactory, but it points to the possibility of getting much better result if more training is applied.
###### Conclusion

This is a preliminary study using very small datasets to demonstrate the possibilities. Further comprehensive experimentation is definitely needed.

The experiments conducted in this report are not meant just as fun applications of the cGAN method. The experiments above show that as an atypical type of supervised learning, cGAN can be used to perform certain types of image operations for achieving practical purposes.

Overall in our limited experiments we have shown that operations such as background erasure and image levels adjustment worked well. For such image operations just training 2% of the frames in a video is sufficient to transfer the image operations to the entire video with good result. The operation of patching up minor flaws has worked to some limited degree.

The operations of scaling and alignment did not work at all, which was expected. This actually shows the limitations of the current cGAN architecture. We may conduct a more detailed study on this in a separate post later.

It is worth noting that the background erasure operation may seem to bear some surface resemblance to semantic segmentation (e.g., as described in this paper), in the sense that both can be used to separate certain recognizable targets out of an image. They are in fact very different, because cGAN is generative, and the method here does not require any training on category labels.

Going Forward

Following are some planned follow-up studies:

1. As extensions to Experiment #3, explore how to take advantage of the sequential nature of a video, where adjacent frames are similar, in order achieve better test quality or faster training time.
2. Use cGAN to synthesize missing frames in a video, or for creating smooth slow-motion replay.
3. Detect outlier video frames that are substantially different from the training dataset. This can be used for carrying out semi-automatic cleanup of a video in order to remove wanted frames, which is quite useful for my own research since processing video for machine learning is a very tedious process.
The idea here is that it is generally believed that GAN is able to learn a meaningful latent representation, and this implies the possibility that the unwanted data samples can be easily detected as some sort of outliers far apart from the training dataset in this latent representation (see Figure 5). It should be interesting to find out if it fact works well in the context of the automatic removal of unwanted video frames.
4. Use cGAN for image/video indexing and retrieval. The idea here is related to the last point regarding the latent representation learned by cGAN, since a good latent representation should make it easier to do indexing and retrieval.
###### Acknowledgments

I want to show my appreciation to the pix2pix team for their excellent paper and implementation. Without which this work would have been much harder to complete.

Last but not least, I want to show my gratitude to Fonchin Chen for helping with the unending process of collecting and processing the images needed for the project.

###### References
1. Isola et al,Image-to-Image Translation with Conditional Generative Adversarial Networks, 2016.
2. pix2pix, a Torch implementation for cGAN
]]>
<![CDATA[Monocular Depth Perception with cGAN]]>

Is it possible to train a cGAN (Conditional Generative Adversarial Networks) model for monocular depth perception?

If the answer is yes, then it would mean that we have a way to allow an artificial system to acquire some basic concept about distance in the physical world, learning from only flat

]]>
http://www.k4ai.com/depth/12fa3ea4-365c-4234-b604-ab852e881cdaTue, 06 Dec 2016 05:46:40 GMT

Is it possible to train a cGAN (Conditional Generative Adversarial Networks) model for monocular depth perception?

If the answer is yes, then it would mean that we have a way to allow an artificial system to acquire some basic concept about distance in the physical world, learning from only flat images, starting with nothing.

The type of training proposed in this report goes as follows:

1. First we train an instance of the cGAN on many pairs of static images of various objects or environment, where the first image in the pair is a full-color photo, and the second image is a depth map of the color photo (see Figure 1 for an example). There is no particular relationship between any two image pairs.
2. After the training result is satisfactory, this trained cGAN can then be to used to convert a unseen photo to a reasonable depth map for the photo. In other words, this cGAN would have achieved monocular depth perception.

Some may think that the premise above is questionable, so let's get these out of way first:

1. If the system already have the equipment for creating the depth maps needed for training, then why would the system need to learn about it?
One possible reason is that once you have trained this system to detect depth on its own, then you can deploy this system very cheaply many many times, without the relatively expensive depth detection hardware (assuming that you don't need the higher precision, etc.).
Another reason is that it is cool to show that cGAN can do this with no pre-programmed logic, practicality aside.
2. Is it inevitable that we will need some depth-sensing hardware, at least for the training phase?
Not necessarily. It is conceivable that we can train such a system in a virtual world, such as the DeepMind Lab, where both the standard camera view and the depth information can be acquired without special hardware. If such a virtual world is sufficiently rich in details, then perhaps it is possible to apply the depth-sensing capability learned there to the physical world.
##### Goals of experiments

In this report we will investigate the above premise with a series experiments. Here we seek to get preliminary answers for the following questions:

1. Can depth perception be trained from monocular static images, using a method like cGAN which was not invented to deal with depth perception at all? Will cGAN turn out to just learn to paint perfect depth maps during training and then fail miserably during testing?
2. Which training regime is easier: training from clean and simple virtual scenes (see Figure 2a, referred to as Regime-V, V for virtual), or training from complex and messy real-world scenes (see Figure 2b, referred to as Regime-R, R for real-world)?
3. Which training regime is more generalizable? It other words, which of the following will give us better result?
• First train on virtual scenes from Regime-V, then test the trained model using real-world scenes from Regime-R.
• First train on real-world scenes from Regime-R, then test the trained model using virtual scenes from Regime-V.

It is worth mentioning that this is a preliminary study on whether such a research direction warrants further investigation, as such it does not contain large-scale experimentation using vast amount of datasets. Judgement on the quality of the result is done with careful analysis but also somewhat subjective, and no attempt is made to support it using precise experimental numbers as is typically done in formal research papers.

###### Context of this research

In my last experiment Generate Photo-realistic Avatars with DCGAN I showed that it is possible to use DCGAN (Deep Convolutional Generative Adversarial Networks) to synthesize photo-realistic animated facial expressions using a model trained from limited number of images or videos of a specific person.

In another report we investigated the idea about building the neural models of human faces using cGAN (as described in the paper Image-to-Image Translation with Conditional Generative Adversarial Networks, referred to as the pix2pix paper below), and apply it for the purpose of synthesizing photo-realistic images from the black-and-white sketch image (either Photoshoped or hand-drawn) of a specific person.

Overall such studies go toward the purpose of achieving the long-term goal of building complex and realistic 3D objects or environments from on interactive verbal commands (ref: How to build a Holodeck)

###### Experimental Setup

The setup for the experiments is as follows:

1. Hardware: Amazon AWS/EC2 g2.2xlarge GPU instance (current generation), 8 vCPUs, 15GB memory, 60GB SSD.
2. Software:
1. Amazon AWS EC2 OS image (AMI): ami-75e3aa62 ubuntu14.04-cuda7.5-tensorflow0.9-smartmind.
2. Torch 7, Python 2.7, Cuda 8
3. cGAN implementation: pix2pix: a Torch implementation for cGAN based on the paper Image-to-Image Translation with Conditional Generative Adversarial Networks by Isola, et al.
3. Datasets: two datasets are used in this report. It should be noted that these two datasets display the depth in different gray scales, where one marks near-far as black-white, while the other as white-black.
• The Foucard dataset, contributed by Louis Foucard, is a Python Blender script for creating large numbers of randomized 3d scenes and corresponding sets of stereoscopic images and depth maps. See Figure 2a for a sample image pair. This dataset is used as our Regime-V dataset. It contains only a handful of geometric objects, with very simple lighting and colors. Since the scenes are virtual, the depth maps are perfectly generated without the artifacts and inaccuracies with real-world-based depth maps acquired through depth-sensing devices.
The original dataset comes with stereoscopic views for the color images. In this report we have randomly selected the left-eye view for the experiments.
• The SUN RGB-D dataset (direct link to a zip file, as well as a 6.9GB processed version shared by Brannon Dorsey) from the SUNRGB-D 3D Object Detection Challenge of the Princeton Vision & Robotics Labs, which is used as the Regime-R dataset for our experiments. A portion of the depth map images in the Princeton dataset are deemed of too low-quality (see Figure 3) and detrimental to the training of cGAN, so they are manually excluded.
4. Training parameters. Unless noted otherwise, all training sessions use the following parameters: batch size: 1, L1 regularization, beta1: 0.5, learning rate: 0.0002, images are flipped horizontally to augment the training. All training images are scaled to 286 pixel width, then cropped to 256 pixel width, with a small random jitter applied during the process.
###### Experiment #1: Training with virtual scenes

In this experiment we use 1500 image pairs from the Foucard dataset for training, and 150 image pairs for testing. Training time is 4 hours using the given setup.

Evaluation using testing samples shows that the system has learned to convert the input color images to match very closely with the corresponding depth maps from the test dataset. In particular (see Figure 4a), the system has learned to ignore the lighting pattern on the walls, as well as the colors and shading of the objects, which play no role in deciding the depth.

One area where the system shows weakness is in judging the depth, which in some cases are either less than perfect, or simply incorrect. Figure 4b demonstrates a case where the cone at left is not rendered correctly in the output image that cGAN generated (at center) from the test input (at left). Since such deficiency usually improves with more training samples, we judge cGAN as overall being capable of learning the depth map, and also quite efficiently.

###### Experiment #2: Training with real-world scenes

In this experiment we use 86 image pairs from the SUN RGB-D dataset for training, and 198 image pairs for testing. Training time is 12 hours using the given setup. The small number of samples used here was due to the difficulty in having to manually screen out low-quality depth maps in the dataset, as well as limitation in the available resources at this time.

Figure 3 shows a typical faulty depth map, prevalent in the SUN RGB-D dataset, which are excluded from training. Such samples are however kept for testing as a benchmark for comparing with the generated depth maps.

Evaluation using test samples shows mixed result. In some cases the model has learned to convert the input color images to match very closely with the corresponding depth maps from the dataset. The example in Figure 5a demonstrates that the model has learned to ignore the lighting pattern on the walls, the colors and shading of the objects, and depth perception is overall quite good.

Figure 5b shows a test result of intermediate quality. Note that the depth map from the SUN RGB-D dataset (at right) contains a large area of black artifact at the upper-left corner, while in comparison the depth map (center image) produced from the photo (left image) by the trained cGAN shows more reasonable result in the same area. However, the chairs are somewhat incompletely rendered.

It is worth noting that separate experiments conducted by Brannon Dorsey at the Branger_Briz digital R&D lab with a 3500 sample training dataset using the same pix2pix implementation does not suffer from the problems shown in Figure 5b, even without manually screening out low-quality samples. So it seems to be the case that such problems were a result of under-training, and that low quality training samples can be overcome given sufficiently large dataset.

Overall we believe that the model has demonstrated the ability to learn to produce generally correct depth maps, and the problems observed are likely due to under-training, or perhaps the quality of the training depth maps.

###### Experiment #3: Extending from virtual to real scenes

In this experiment we use the model trained in Experiment #1 based on Regime-V using virtual scenes, and apply it towards a Regime-R test dataset with real-world scenes.

The quality of the test result from this experiment is judged as extremely low, where the trained Regime-V model shows little comprehension of depth in real-world scenes (see Figure 6). It can be seen in Figure 6 that the output (center image) resembles merely a blurry gray-scale version of the input image, still preserving the irrelevant light and shadow, and pixel level gray level has no correlation with depth.

Obviously the virtual scenes from the Regime-V training dataset do not contain sufficient cues to allow the model to cover real-world scenes. For further research it would be interesting to find out what kind of virtual scene dataset would be sufficient for training a model that performs satisfactorily with test real-world scenes. For example, if we train a cGAN agent inside DeepMind Lab's 3D learning environment, would such an agent transplants well to a physical robot navigating in the physical world?

###### Experiment #4: Extending from real to virtual scenes

In this experiment we use the model trained in Experiment #2 based on Regime-R using real-world scenes, and apply it towards a Regime-V test dataset containing virtual scenes.

The quality of the test result from this experiment is judged as very poor (see Figure 7). Overall the entire 158 test samples are all like this, where shades of those phantom furniture from the Regime-R dataset are almost visible.

Obviously the samples from the Regime-R and Regime-V training datasets are sufficiently different that the result is not transferable between them.

###### Summary

We find that training cGAN for monocular depth perception from static image pairs in likely feasible, and the experiments should be expanded with much larger training datasets with more varieties. Training on Regime-R with real-world photos take many times longer than training on Regime-V dataset, likely due to complexity of the real-world scenes, as well as the poorer quality of the depth maps acquired through depth-sensing devices.

The experiments above was conducted entirely with existing datasets contributed by others. Aside from problem with the quality of real-world depth maps, there are also problems with the inconsistent depth map color schemes used in different datasets, which makes it difficult to use them together without further processing. With the advent of low-cost depth-sensing devices such as the Google Tango, higher-resolution Kinect, or suitable smartphone-based depth-sensing apps, it would be interesting to expand the experiment using self-generated datasets targeting specific areas (e.g., human faces or poses, etc.).

So how could this be put to practical use? While in theory such capability in depth perception can be applied towards something like robotic navigation, in its current form it is perhaps too primitive be competitive with other more matured ANN-based approaches. However, if the robustness of this approach can be demonstrated in further studies, then it is conceivable that it can be used to add low-precision 3D perspective to the vast amount of photos or videos available out there.

###### Going forward

There are several possible research directions going forward:

1. Test with much larger datasets to confirm the result.
2. Test with outdoor scenes, animals, people, and faces.
3. Test with stereoscopic datasets.
4. Test with videos, perhaps involving extending cGAN into the time domain, or borrowing some ideas from the VideoGAN.
5. Test in a rich interactive virtual 3D world, such as the DeepMind Lab. Also learn how to correlate depth perception with agent actions and consequences in such a virtual world.
###### Upcoming

Can cGAN be trained to perform automatic image operations, such as erasing background in photos, align and resize faces, etc? We shall explore this topic in a separate post.

###### Acknowledgments

The idea of applying cGAN to depth perception came originally from Brannon Dorsey at the Branger_Briz digital R&D lab, who also graciously shared his dataset and model for use in the experiments here.

I want to show my appreciation to the pix2pix team for their excellent paper and implementation. Without which this work would have been much harder to complete.

I also want to thank Louis Foucard and the Princeton Vision & Robotics Labs for making their datasets available.

Last but not least, I want to show my gratitude to Fonchin Chen for helping with the unending process of collecting and processing the images needed for the project.

###### References
1. Isola et al,Image-to-Image Translation with Conditional Generative Adversarial Networks, 2016.
2. pix2pix, a Torch implementation for cGAN
3. The Louis Foucard dataset
4. The SUN RGB-D dataset (SUNRGBD.zip) from the SUNRGB-D 3D Object Detection Challenge of the Princeton Vision & Robotics Labs
5. Vondrick et al, Generating Videos with Scene Dynamics, 2016.
6. Kendall et al, PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization, 2016. Video, article, source code (Caffe, Tensorflow).
]]>
<![CDATA[Generate Photo-realistic image from sketch using cGAN]]>

In this report we study the possibility of building the neural model of human faces using cGAN.

In my last experiment Generate Photo-realistic Avatars with DCGAN I showed that it is possible to use DCGAN (Deep Convolutional Generative Adversarial Networks), the non-conditional variation of GAN, to synthesize photo-realistic animated facial

]]>
http://www.k4ai.com/cgan/254a6854-494b-410f-9317-5363d1c488bfTue, 29 Nov 2016 02:10:53 GMT

In this report we study the possibility of building the neural model of human faces using cGAN.

In my last experiment Generate Photo-realistic Avatars with DCGAN I showed that it is possible to use DCGAN (Deep Convolutional Generative Adversarial Networks), the non-conditional variation of GAN, to synthesize photo-realistic animated facial expressions using a model trained from limited number of images or videos of a specific person.

This report is a follow-up on the general idea, but this time we want to use the cGAN as described in the paper Image-to-Image Translation with Conditional Generative Adversarial Networks (referred to as the pix2pix paper below), and apply it for the purpose of synthesizing photo-realistic images from the black-and-white sketch images (either Photoshoped or hand-drawn) of a specific person.

Overall this report is an empirical study on cGAN, with an eye towards finding practical applications for the technology (see the Motivation section below).

###### Motivation

Our long-term goal is to build a crowd-contributed repository of 3D models that represent the objects in our physical world. As opposed to scanning and representing physical object using the traditional mathematical 3D model representation, we want to explore the idea of using a representation based on the Artificial Neural Networks (ANN) for it ability to learn, infer, associate, and encode rich probability distribution of visual details. We call such ANN-based representation the neural model of a physical object.

The intuition behind studying cGAN here is that if cGAN is capable of generating realistic visual details when given only scanty information, then perhaps it has in fact constitute an adequate representation for many visual aspects of a complex physical object.

The reasons for choosing human faces in this study are because:

1. Such images or videos are abundant and easy to acquire.
2. Human facial expressions are fairly complex and a good subject for study.
3. We are instinctively sensitive to images of human faces, thus the bar for the experiments is naturally higher than using other types of images. This will allows us to spot problems in the experimental results more quickly.
4. Human faces involve precise geometric relationship among facial features (eyes, nose, etc.). As such, it is a good candidate for studying what it takes in order for a generative system like cGAN to discover feature structure at the instance-level, and not just probability distribution at the population level.
5. There are arguably more practical applications for human faces.

As a first step towards the long-term goal stated above, we choose to use cGAN for building the neural model over the faces of a specific person. This differs from the typical GAN applications which tend to apply towards a wide variety of images. If successful then we will proceed with using cGAN or its extension on other types of physical objects.

###### Goal of Experiments

Using human faces as the subject matter for a series of experiments, we seek to answer the following questions:

1. How far can we push cGAN to fill in satisfactory details when only scanty information is provided in the input image, using relatively small training dataset.
2. Overall is cGAN suitable for use as the basis to build the neural model of a specific person's face, representing the multitude of visual details regarding this person's face. For example, can cGAN be trained to accommodate artifacts in the test input image, to recover from aberrant input, recover from missing parts, etc.
3. Is the cGAN neural model of a person transferable to another person.
4. The usefulness of building a universal cGAN neural model for all human faces.
###### The Setup

The setup for our experiments is as follows:

1. Hardware: Amazon AWS/EC2 g2.2xlarge GPU instance (current generation), 8 vCPUs, 15GB memory, 60GB SSD.
2. Software:
1. Amazon AWS EC2 OS image (AMI): ami-75e3aa62 ubuntu14.04-cuda7.5-tensorflow0.9-smartmind.
2. Torch 7, Python 2.7, Cuda 8
3. cGAN implementation: Torch implementation for cGAN based on the paper Image-to-Image Translation with Conditional Generative Adversarial Networks by Isola, et al.
3. Datasets: the following apply unless noted otherwise.
1. The input images (either for training or testing) are gray scale images manually created from the ground truth color images by an artist using various Photoshop filters. A few images are hand-drawn either to copy a ground truth photo intended for training, or drawn free-hand without using a ground truth photo intended for testing.
2. Images are cropped to 400x400 pixel size.
3. Images are manually aligned to have the center point between the two eyes at a fixed point in the 400x400 frame.
4. Training parameters. Unless noted otherwise, all training sessions use the following parameters: batch size: 1, L1 regularization, beta1: 0.5, learning rate: 0.0002, images are flipped horizontally to augment the training. All training images are scaled to 286 pixel width, then cropped to 256 pixel width.
###### Baseline Experiment: building the AJ Model

Here we attempt to build a neural model of American actress, filmmaker, and humanitarian Angelina Jolie (referred to as the AJ Model below) using a relatively small training dataset.

The ground truth images (or target images) are color photos of the Angelina Jolie, manually scrapped from over the Internet. The input images (for either training or testing) are manually processed by an artist using Photoshop, by converting them to black-and-white with filter effects. All input images have one particular style of effect applied, which we shall call the Style A effect. The center image in Figure 1a shows the typical output from the testing phase, which are sampled from the trained model using the input image at left.

Training

1. Training dataset: 21 image pairs are used for training. The input images are all created from the ground truth images by an artist using a particular Photoshop effect (see Figure 1a, referred to as the Style A effect below).
2. Test dataset: 10 images from the training set, all have the Style A effect applied (see Figure 1a).
3. Training parameters: default. Training time: 4 hours, 2000 epochs.

Analysis

Regardless of the small size of the training dataset, overall cGAN does a good job converting from the black-and-white input images to the color version, with a great deal of convincing shading and colors that very closely match the ground truth photos. Some observations:

• During testing cGAN is able to infer reasonable shading in the output image (e.g., center image in Figure 1a), such as the pinkish hue around the cheeks, even if most of the facial area in the input image (at left) is mostly flat gray without any gradient. .
• The color and texture of the lips are convincing, even if those are given little details in the input image.
• In most of the ground truth photos the subject shows the application of eye shadows. In Figure 1a even though the input image shows little sign of the eye shadows (especially under the eyes), nonetheless cGAN was able to apply convincing eye shadows.
###### Experiment #1: the case of too many effects

While the baseline experiment above shows good result, the input images are in fact entirely uniform with only one type of effect applied to reduce them to black-and-white. Here we want to find out what would happen when the input images include effects in several varieties.

Test 1.A

The AJ Model from the baseline experiment is used, but the test samples contain some input images with Style B effect applied (see left image in Figure 1b).

Analysis

1. Figure 1b demonstrates a case where the test input image has the Style B effect applied, which is not present in any of the training images. The resulting output images generated (see center image in Figure 1b for an example) show a kind of woodblock printing effect that displays only a few colors with almost no gradient. This problem is further studied in the Test 1.B below.

Test 1.B

1. Training dataset: same as in Experiment #1, but augmented with more training samples that have the Style B effect applied to the input images. Total 68 training pairs.
2. Training parameters: trained from the model derived in Experiment #1, 6.5 hours training time, 1000 epochs, other parameters same as Experiment #1.

Figure 1c shows the result from this test, where the same test image now appears photo-realistic without the woodblock printing effect.

Further tests with additional effects (see left images in Figure 1d) show a general pattern, i.e., test samples with new effect (i.e., that the model was never trained on samples with the effect applied) tend to show poor result, and including such samples in training resolves the problem.

While the result above is not entirely surprising, we do wish to find ways to make the model more tolerant to a wider variety of effects, so that we don't have to retrain cGAN on every new effect.

###### Experiment #2: the case of mutilated faces

In this experiment we want to find out whether it is possible to recover missing facial features in the input images. This is of interest here because as a neural model we would want it to be able to infer missing information from partial or altered observations.

Test 2.A

We created a set of new test input images applied with Style A or Style B effects, then manually modified to have certain facial features erased. These test images are then used to sample the AJ Model from Experiment #1 (which has been trained with Style A & B effects). The result is shown in Figure 2a, which demonstrates that the model is unable to recover the facial feature omitted in the input images.

Test 2.B

The two samples shown in Figure 2a, which were used only as test samples earlier, are now included here for training.

Figure 2b shows the result after 4000 epochs of training. Note that in the top row of Figure 2b, the output image (at center) has been repaired by cGAN with a somewhat acceptable nose, though smaller than in the ground truth photo. The output image (at center) in the bottom row has been repaired with an eye that seems to be a copy from the ground truth photo, but it is larger and not quite in the right place.

A curious effect is observed (using pix2pix's Display UI tool) during the training phase of this experiment, where successive snapshots show the missing part moving and resized around the face, with no clear sign of convergence. Figure 2c and 2d give a glimpse of the phenomenon.

The problem was eventually resolved by turning off the random jitter operation which was applied by this cGAN implementation by default. The random jitter operation essentially add some small randomness in the cropping and re-sizing the images, which seem to work well for other types of subject matters. Our conjecture is that such an operation does not appear to work well in this particular experiment in part because we are extremely sensitive to the precise relative positioning of facial parts, so while we tolerate it in other types of subject matters (e.g., street scenes, building facades, etc.), it become much more noticeable with faces.

With the random jitter removed it can then be observed during training that missing parts are being repaired to near perfection. This of course does not mean much, unless it can also do so with new test images. This is further investigated below.

Test 2.C

The model trained from Test 2.B (referred to as the model-2B below) is observed to repair the missing nose and eye satisfactorily during training, the next question is whether such repair is transferable in the following sense:

1. Given a test input image of the same person with the same defect, whether model-2B can achieve satisfactory repair.
The answer is sort of. Figure 2f top row shows that cGAN has learned to repair a missing left eye during training. When the new input image (bottom left) with the same defect is given, the trained model repaired it with a left eye that seemingly belong to another person.
This bring up an interesting question regarding whether cGAN as it is today is able to learn structured relationship among features in the image, such as the mirror symmetry in 3D of the two eyes.
2. Given the image of the same person with a slightly different defect, whether model-2B can achieve satisfactory repair.
Figure 2e shows an input image (at left) with a missing right eye (was a missing left eye in Figure 2b) which is used to sample against model-2B. The resulting output image (at center) shows no repair made to the right eye at all. However, it can be observed that the model choose to repair the good left eye and replaced it with a larger version.
At this point it is a mystery how this has happened, and whether it is possible to find a solution.
One conjecture is that this cGAN implementation's flip parameter is in play here, but this remains to be verified.
3. Given the image of another person with the same defect, whether model-2B can achieve the same repair.
Figure 2g shows the result of this test, where first a model is trained to repair a missing left eye, then we use the input image of a different person (i.e., the left image in Figure 2g) to sample again the trained model. The result (center image in Figure 2g) shows that the model was able to put up a faint left eye in the correct position that does not match the right eye in shape or color.
This result is expected, since model-2B is trained on AJ's images, it thus represents the probability distribution of her facial features alone. When the test image of a different person is used to sample against model-2B, the rendered repair will naturally yield AJ's features.
In order to pass this test, the system must be capable of learning the constraints on the relationship among features, e.g., the fact that the two eyes must match in certain ways. This is a topic beyond the scope of this report.
###### Experiment #3: from art to photo-realism

All of the black-and-white input images used in the experiments above are processed by an artist using Photoshop. This means that such an input image is a precise reduction of a ground truth photo, it thus retain a great deal of precision regarding the position and arrangement of many visual features in relation to its ground truth counterpart.

In this experiment we seek to find out if an input image is entirely hand-drawn, with all of the imprecision of a human hand, then can such art work be converted to a photo-realistic image, like those other Photoshop-processed samples that we have seen before. This is somewhat similar to the handbag example in the pix2pix paper with hand-drawn outline, but we get to check it out using human faces.

For this experiment we asked an artist to find a photo of Angelina Jolie, draw several black-and-white sketches by hand based on the photo, such image pairs (the original photo and the sketch) are then used for additional experiments. The sketch was made using graphite pencil on paper, then scanned and converted a 400x400 jpeg file, which manual retouch in Photoshop as needed.

Test 3.A

Here we use the photo-sketch pairs as new test samples against the model from Experiment #2, which was trained on Style A and Style B effects in the input images, but never on imprecise hand-sketched samples (let's call this hand-drawn effect Style C). Figure 3a shows the initial result. It is not unexpected since the model has never been trained on this style.

Test 3.B

Here we include some hand-drawn samples in the training phase to derive a new model (referred to as model-3B below). When the input image in Figure 3b is sampled against model-3B, The result (center image in Figure 3b) shows much improvement than what's in Figure 3a, though still somewhat blurry, possibly due to insufficient training. The output image is judged to be too similar to the ground truth photo used to train model-3B, so this experiment should be repeated with more samples.

###### Experiment #4: the case of mistaken identity

Given that we have built a neural model of Angelina Jolie (referred to as the AJ Model), how useful is it when trying to apply it to other people? Since a neural model trained exclusively on one person represents the probability distribution of this person's facial features, it is expected that applying the AJ Model to another person's photos will get us somewhat reasonable result, but with some limits.

Figure 4a shows the result of sampling American actor and producer Brad Pitt based on the AJ Model. As expected the result (center image) shows somewhat reasonable colors and shading, but it also picks up softer feminine lines, a less stubby beard, and Jolie's brown hair color.

Similarly in Figure 4b, when sampling against the AJ Model using an input image of the American singer, songwriter and actress Beyoncé results in an output image (at center) that picks up the lighter skin tone of Angelina Jolie.

From the perspective of building neural models for human faces, it would seem that it is appropriate to have a separate model for each individual of interest. It would be interesting to see the creation of a hierarchy of such models, where the top one represents a model for all human faces, the bottom leaf models represent specific individuals, and those in between models represent groups of people (such as by race, by distinct features, etc.). With a well-designed mechanism we might be able to derive much efficiency in the training and storage from such a hierarchical structure of many models.

###### Experiment #5: the case of decomposing faces

In this experiment we want to study how to decompose a face into parts, so that each part can be manipulated individually.

Why is this important? This is because if a neural model is composed of parts that can be learned without supervision, and that such parts can be treated as shared features across sample instances, then it is possible to achieve a kind of one-shot learning.

For example, assuming that cGAN is able to generate facial parts (e.g., eyes, noses, etc.) during its training process (just like a typical deep CNN could), and that the two noses in two photos activate the same neuron in cGAN, then you can say that this neuron now represents an anonymous concept of nose.

If we now attach a text label 'nose' to the image of a nose in photo A, then the system would know right way that the 'nose' label is likely also applicable to all those other noses in other photos. So here we have achieved a sort of one-shot supervised learning through the common nose neuron mentioned above.

If we use cGAN as the basis for implementing the neural model in question, then it would mean the following:

1. High-level features created through the training process should be used as the basis for shared features across face instances.
2. We cannot have entirely separate neural models (i.e., completely separate cGAN instances) for two individuals, since in that case there is then no way to create a generalized concept of a feature (e.g., a nose) across individuals.

This is a topic which will be explored further in a separate post.

###### Experiment #6: from photo to imitated artwork

In this experiment we seek to apply cGAN in the other direction, by mapping from color photos to sketches of a certain style.

This turns out to be very easy, at least for those manually applied Photoshop effects that we have used in the previous experiments.

We use a training dataset of 48 pairs of images, where all the black-and-white images are manually created using the same Photoshop effect applied to the color photos. cGAN is then trained to map form color photos to black-and-white. After training for one hour we use the trained model on a separate set of color photos for testing. Figure 6a shows a typical test input image (left, in color), which is converted to black-and-white output image (center) by the trained cGAN model. The result is deemed very good when the output image is compared with another image (right) converted manually by an artist using the same Photoshop effect used to create the training dataset.

So with this it is then possible to have use cGAN to bootstrap our own experiments as follows in order to reduce the amount of manual work:

1. Manually prepare a set of black-and-white photos S1 of the target effect (as seen in experiments 1-5) using a tool such as Photoshop.
2. Use S1 as a training set, but map it in the other direction to have a cGAN model learn how to reproduce the target effect.
3. This effect cGAN can then be used as a tool to generate more dataset without the involvement of manual work by an artist.

This technique should be applicable to many types of datasets that involves some sort of straight-forward information reduction.

It would be interesting to see how far we can push it for generating more artistic and less faithful effects, such as caricatures, etc.

###### Conclusions

In this report we have conducted a series of empirical studies on the possibility of using cGAN as the basis for building a neural representation of human faces, with an eye towards applying the same technique to other types of physical objects in the future.

This particular flavor of the Conditional GAN allows us to map from an input image to another image, which gives us a handle to use cGAN in many ways.

Following is a summary of observations made from this study.

1. cGAN has great potential serving as the basis for modeling complex physical objects such as human faces. It can be used to model the visual features of either an individual or an entire population.
2. Alignment of faces in the images turned out to be quite important. For best results all faces should be aligned and resized uniformly, with the two eyes on a horizontal line at roughly the same position. Deviation leads to the many problems. This is not surprising, since in a way such alignment normalizes the location of the facial features and makes training simpler. Following are problems observed from unaligned samples:
1. In the output images the color could appear faded out, diffused, or not very photo-realistic.
2. For the missing feature problem studied in Experiment #2, unaligned faces (such as those highlighted in Figure 2c/2d/2e) are much harder to train.
3. Training cGAN to repair mutilated facial images (e.g., missing an eye or part of the nose), especially across identities, proved to be challenging. This is not unexpected, since this highlights the following issues:
1. Need to find a way to learn structured features at the image instance level. For instance, repairing an eye in an particular image likely cannot be achieved by applying the average eye from all training samples.
2. Need to find a way to manage the relationship at the feature level across different models. For example, an eye from Jolie's face model shares some features from an eye from another person's model, but at the same time they are also not exchangeable.

The experiments described above were conducted with very limited amount of data samples, as well as limited model training time. The observations and suggestions made above are quite preliminary, and further study is warranted.

###### Going Forward

There are several possible applications of the cGAN technology (or its extension) that we want to explore in separate posts:

1. Use cGAN for achieving monocular depth perception.
Here we seek to find whether cGAN can be used to convert a normal photo to a depth map, where the grayness in each pixel represents the distance to the target (see Figure 7). We want to know whether cGAN can be trained to achieve this, or will it merely learn to paint the likeness of a depth map that cannot be generalized beyond training samples.
2. Use cGAN for image segmentation.
Here we seek to find out whether it is possible to teach cGAN to segment and extract part of an image by learning from examples.
3. Use cGAN for one-shot learning.
Here we seek to get cGAN to learn a concept from just one case of supervised learning. For example, after cGAN has processed a number of face images unsupervised, adding a label 'nose' to part of one sample will allow it to correctly point out the noses in all samples.
###### Acknowledgments

I want to show my appreciation to the pix2pix team for their excellent paper and implementation. Without which this work would have been much harder to complete.

I also want to show my gratitude to Fonchin Chen and Michelle Chen for offering to do the hand-drawn sketches, as well as helping with the unending process of collecting and processing the images needed for the project.

]]>
<![CDATA[Generate Photo-realistic Avatars with DCGAN]]>In this report we explore the feasibility of using DCGAN (Deep Convolutional Generative Adversarial Networks) to generate the neural model of a specific person from limited amount of images or videos, with the aim of creating a controllable avatar with photo-realistic animated expressions out of such a neural model.

Here

]]>
http://www.k4ai.com/avatars/9ca76f3c-85d3-4dff-9574-77231320a2c0Tue, 15 Nov 2016 19:09:21 GMT

In this report we explore the feasibility of using DCGAN (Deep Convolutional Generative Adversarial Networks) to generate the neural model of a specific person from limited amount of images or videos, with the aim of creating a controllable avatar with photo-realistic animated expressions out of such a neural model.

Here DCGAN holds the promise that the neural model created from it can be used to interpolate arbitrary non-existent images in order to render a photo-realistic and convincing animated avatar that closely resembles the original person.

###### Context of this research

This is part of a long-term open-source research effort, called the HAI project. The grand vision of the HAI project is to build a crowd-driven and open-source knowledge base (in the spirit of the Wikipedia) for replicating our 3D world, enabled and enriched through the use of neural models.

This report is a first step in this direction, using human faces as the subject matter for a detailed study. We want to verify whether DCGAN can be used to build a satisfactory neural model of human faces through unsupervised learning, so that we can proceed to create an avatar out of such a neural model.

A broad survey of some DCGAN and related papers that precede this report can be found here, which helps to explain the thought process that leads to this report.

###### Why Neural Model

The neural model of a physical object differs from the traditional 3D graphic format in that it does not explicitly express the precise geometric structure of a physical 3D object, but rather it is a collection of many levels of visual features encoded in the layers of a certain artificial neural network.

The recent advancement in the DCGAN technology shows that it is capable of learning a hierarchical image representation from 2D image samples, unsupervised. This leads to the possibility of extending and then using it as a representation for static or dynamic physical objects. Such a neural network-based object representation holds long-term benefits in the following sense:

1. Given the tremendous recent progress in artificial neural networks (e.g., CNN, RNN, LSTM, dilated causal CNN, etc.), having the physical objects also represented in the same form will greatly simplify multi-modal learning (e.g., with text, sounds, etc.) involving physical objects.
2. The vector representation generated by DCGAN can be used to support various useful operations, such as the vector arithmetic that maps to meaningful operations on the images, as described here.
3. Supervised learning can be performed based on such a vector representation in order to acquire the mapping between visual objects and other modalities.

While the above areas will not be covered in this report, they do explain our motivation for studying the neural model approach.

###### Our Challenges

There have been many prior experiments where DCGAN is used to generate seemingly realistic random bedroom scenes, faces, flowers, manga, album covers, etc.

Here we seek to push it further to answer the following questions:

1. Can DCGAN be used as the basis for generating the neural model of a specific object?
Here as opposed to simply interpolating from many random training examples to generate broadly natural-looking images, we seek to use DCGAN to create a neural model for representing the dynamic views of a specific physical object, and also find practical applications for the method.
2. Can photo-realistic and animated facial expressions of a specific person be created out a trained DCGAN model?
Here we choose human faces as our subject matter for the experiment, because such images or videos of human faces are abundant and easy to acquire. And since we are sensitive to even minor deformities in human faces, the bar here is naturally high.
3. How far can we push DCGAN to work reasonably well with training datasets that are very small and with little varieties (since they are all about the same person when building an avatar)?
4. Do we gain any advantage by training DCGAN on video samples?
###### The Setup

The setup for our experiments is as follows:

1. Hardware: Amazon AWS/EC2 g2.2xlarge GPU instance (current generation), 8 vCPUs, 15GB memory, 60GB SSD.
Note that I have also looked into using the Google Cloud Platform (GCP) for such experiments, but unfortunately GCP does not offer GPU instances at this time. Some comparisons of using AWS/EC2 and GCP for running DCGAN jobs can be found here.
2. Software:
1. Amazon AWS EC2 OS image (AMI): ami-75e3aa62 ubuntu14.04-cuda7.5-tensorflow0.9-smartmind. This AMI contains most of the configuration needed for this experiment, such as TensorFlow.
2. TensorFlow 0.9, Python 2.7, Cuda 7.5
3. DCGAN implementation: a Tensorflow implementation of DCGAN based on the paper Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks by Radford, et al.
###### Experiment #1: baseline reference

Here we run a well-tested DCGAN experiment as a baseline reference for additional experiments. Later we will also seek to use the result here for solving some problems encountered.

1. Dataset: 109631 celebrity photos scrapped from over the internet. Photos have been cropped and aligned (by the center point between the eyes) programmingly. This dataset can be found here.
2. Test parameters: mini-batch size = 64, Adam optimizer, beta1 (momentum of Adam optimizer) = 0.5, learning rate = 0.0002.
3. Result was similar to what other DCGAN experimenters have published earlier, where generally convincing faces are created by the generator. The following were produced at 20% of one epoch by randomly sampling the Z vector.
###### Experiment #2: tiny dataset of a specific person

Here we want to find out how small a dataset that we can get away with. Note that the dataset that we use in the baseline above is rather large (> 100000 photos) and contains photos of diverse identities. In this experiment we want to push it to the other extreme by using a very small dataset of one specific person.

The problem with small data set has been studied by Nicholas Guttenberg, where the space could become exceedingly jagged for gradient descent to converge to fixed points. So first let's see what happens in this experiment, and we will then seek remedies for it.

1. Dataset: 64 photos one specific person. Several variations were tested:
• Manually scrapped photos of the Duchess of Cambridge, Kate Middleton. Photos were manually cropped with no additional processing done. See Figure 2:
• Same as above, but the background was manually removed.
• Stock photos of a multitude of expressions of one person, with consistent background, lighting, hair style, and clothing. See Figure 3.
2. Test parameters: Adam optimizer, beta1 (momentum of Adam optimizer) = 0.5, learning rate = 0.0002. Mini-batch sizes range from one, 9, to 64 (i.e., each epoch contains only one batch).
3. Result: the well-reported problems with model collapse or instability (see the Guttenberg or OpenAI articles) were observed, as such no usable result was achieved.
More specifically, the model usually falls into one of the following states:
• Symptom A: the entire model collapsed to very small number of samples which render nearly perfectly, but all other points in the Z representation lead to highly mangled images. The discriminator loss values stay consistently low, while the generator loss values stay very high. Longer training does not help.
• Symptom B: randomly sampled points from the Z representation show that all such points generate nearly identical image M, and M changes from one mini-batch to another. The discriminator loss values stay consistently low, while the generator loss values stay high. Longer training does not help.

Some previously suggested remedies for such problems include dealing with the batchnorm, applying regularization, adding noise (see the Radford paper), or using the minibatch discrimination technique (see the Salimans et al paper).

We proceed to test out the minibatch discrimination technique as suggested in the Salimans et al paper. More specifically:

1. Mini-batches are created not as disjoint subsets of the full batch, but rather as staggered sets with some overlap so a training sample could belong to two minibatchs.
2. Human judgement is applied in the creation of minibatch so that samples in a minibatch tend to be similar to each other. While this kind of intervention might seem like an anathema to the ideal of unsupervised learning, this is in fact not an issue when this experiment is extended to deal with video training samples, where adjacent frames in the video are naturally similar to each other. It is in fact my opinion that video is a more natural training samples for our goal here.

We got somewhat better result from applying the minibatch discrimination, where it seems less prone to model collapse, but the generated images remain largely mangled (see Figure 4).

###### Experiment #3: tiny dataset of a specific person - aligned

While inspecting the mangled faces from Experiment #2 (see Figure 4), it was suspected that perhaps the alignment in the training samples is important. Our observation seems to indicate that no amount of training can get rid of the problem.

This suspicion is reinforced by the fact that the celebA dataset shows no such a problem, where the images have been programmingly processed to have the center point between the two eyes aligned to the center of the image, and also rotated to have both eyes on a horizontal line.

As such we repeated Experiment #2, but this time we have the training samples further processed to have the same alignment as the celebA dataset.

Following are details regarding this experiment:

1. Dataset: the same Kate dataset as in Experiment #2 is used, except that images are manually aligned.
2. Test parameters: same as Experiment #2.
3. Result: it was surprising that simple alignment has resolved the mangled image problem mentioned in Experiment #2, and we were able to produce reasonable result with small training dataset. With care we were able to general usable images.
###### Experiment #4: video dataset

Using a suitable video as a source of training samples is desirable for the following reasons:

1. It helps to alleviate training difficulties arise out of inconsistency in lighting, hair style, makeups, clothing, image background, aging, etc. Such problems are prevalent in scrapped datasets.
2. Videos are abundant and easy to acquire.
3. Videos provide critical timing information that helps to make animation more natural. Note that this aspect is left for future research.
4. Videos provide information about temporal patterns that is otherwise unavailable. Note that this aspect is left for future research.
5. The implied object persistence in a video (i.e., the object recognized in frame N is likely to be the same as the similar object recognized in frame N+1). This affords us a kind of anonymous unsupervised labels that opens up a lot of new research directions. Note that this aspect is left for future research.

Following are details regarding this experiment:

1. Dataset: based on a video published to YouTube on 2015 regarding an Interview with Adele. Segments of the video was taken at the rate of one frame per second, manually cropped with Adele front and center, manually aligned and reduced to 178x218 (WxH) pixel size, same as the celebA dataset. Care was taken to ensure that the resulting images preserve the original order. Here we intentionally choose a low frame rate so that we can clearly see whether DCGAN is effective in filling in the gaps. We also intentionally choose a celebrity so that it is easier to judge the likeness of what DCGAN generates. Adele is chosen because she tends to have a wider variety of expressions.
2. Test parameters: same as Experiment #1.
3. Result: following is a set of 64 images randomly sampled from a trained model.
This examples shows that the DCGAN model has acquired a multitude of expressions from the training samples, and is able to generate reasonable interpolation from the training samples.
###### Experiment #5: building a reusable model

Here we seek to answer two questions:

1. Can we build some sort of a Universal Face Model (UFM) that captures the essence of all human faces, so that it can be reused for training new face datasets, with the hope of achieving reduced training time and better image quality? For otherwise each time when we want to create an avatar for a new person then we will always have to retrain DCGAN from scratch, which typically takes quite some time.
2. How much of performance gain can we get with such a reusable universal face model?

Experimental setup

1. The model trained on the celebA dataset (see Experiment #1), which contains >100k distinct faces, is used as our target UFM. This UFM was trained on around 20000 minibatches, each minibatch containing 64 photos, and took nearly whole day to complete on our low-end GPU instance.
2. A set of 178 Kate Middleton photos is used as our New Person (NP) dataset. The photos in NP has been cropped and aligned in exactly the same manner as those photos in the UFM dataset.
3. All sampled images are taken at 64 fixed random points in the Z representation.
4. Test #1: this test is initialized with the UFM model, then trained on the NP dataset for 75 minibatches. Figure 6a shows the sampled images at the start of this test, which does not yet contain any influence from the NP dataset. Figure 6b shows the sampled images at the end of the test, which shows fairly reasonable likeness to the target subject. This training took 298 seconds on our low-end GPU instance.
5. Test #2: this test starts with an empty model, then trained on the NP dataset for 75 minibatches, same as in Test #1. Figure 7a shows the sampled images at the start of this test, which contains just noise. Figure 7b shows the sampled images at the end of the test, which is still quite rough. This training took 3 minutes.

We draw the following conclusions from this test:

1. The trained model from the celebA dataset (see Experiment #1) appears to be an adequate Universal Face Model. Intuitively this makes sense, since the model's generator have been trained from large number of distinct faces, so it must contain layers of features common to most faces. As such we should be able to take advantage of it by basing the NP dataset on top of it.
2. The limited tests above shows that training based on UFM can be several times faster than training from scratch.
3. We have observed some sort of noise texture in the generated image. Referring to Figure 8, we analyze this as follows:

• Case #1: model is trained from scratch using the Adele video dataset (which contains only 81 images). The generated images when inspected up close appear to be noisy (see Figure 8.a for normal-size view, and Figure 8.b for a magnified view). It seems that no amount of further training is able to remedy this.
• Case #2: the noise texture problem does not happen in the baseline experiment (see Figure 8.c for a sample).
• Case #3: the noise problem also does not happen when the model is trained based on a UFM (see Figure. 8.d for an example). Here the UFM is used as the initial model, the same Adele dataset is then trained under exactly the same parameters as in Case #1.

The Radford paper has demonstrated similar phenomenon in the Figure 3 of that paper, where it states repeated noise textures across multiple samples such as the base boards, and this was attributed to under-fitting. Given that the size our training dataset tends to be relatively tiny, it is not surprising that we observe such under-fitting problem. Case #2 escaped this problem due to the sheet size and variety of its training dataset. In Case #3 we show that by using the Case #2 model was as a starting point (i.e., treating it as a UFM) the under-fitting problem is alleviated.

###### Create Animated Expressions

Once we have a good model trained out of the photos or videos of one specific person, it is then possible to create photo-realistic animated expressions out of it. A simplistic method for this is as follows:

1. Visually inspect a gallery of generated images and identify the source S (e.g., a neural expression) and target T (e.g., smiling) expressions of interest.
2. Plot a straight line in the Z representation to traverse from S to T, find the set of points {P} that divide the line into equal parts, then writes out the generate images for {P} along the path. In the examples here we chose to divide the line into 20 parts and the resulting images are animated with a 2 second duration, animated both in the forward and backward direction.
3. Use a tool (e.g., ffmpeg, or the Python library MoviePy, etc.) to combine the images {P} into an animated GIF file or video.
4. Visually select the resulting animated GIF files for those that show the best effect.

The Figure 9 series are examples of animated expressions synthesized entirely from a trained DCGAN model. All animations were created using the same parameters and setup. The jump in the animation is caused by the looping of the GIF file.

The above is a kind of brute force and simplistic animation, since it completely ignores patterns in human facial expressions. The animated expression created out of a straight line in the Z representation doesn't always look convincing, thus needs to carefully screened. There are many possibilities for creating better animation out of a trained model, and these are left for future research.

###### Conclusions

To answer the questions asked at the top of this report (see the Our Challenges section):

1. Can DCGAN be used as the basis for generating the neural model of a specific object?
Our experiment shows that a reusable Universal Face Model helps to reduce training time, as well as alleviate the noise texture problem that comes from the use of small datasets. We believe that given the promising result, further research for objects beyond faces is warranted.
2. Can photo-realistic and animated facial expressions be created out a trained DCGAN model?
We were able to create animated expressions from a trained model, which demonstrated that the basic premise is sound. The resulting quality is somewhat low, in part due to the limited computing power available at the time of the experiment. We have reason to believe that much higher graphic quality and full range of realistic expression is within reach using this approach.
3. How far can we push DCGAN to work reasonably well with training datasets that are very small and with little varieties (since they are all about the same person)?

We managed to produce reasonable models using DCGAN with as few as 64 training samples. For human faces it seems that alignment is the key. However, we did observe the following problems, which have been reported by other DCGAN experimenters:

1. Model collapse. We have frequently observed the collapse of the model during training, where the model generates only one or very few distinct images, and the moving average of the gLoss/dLoss value (computed from dividing the loss value for the generator by that of the discriminator's) has exploded. In contrast, This has not being observed in the baseline (i.e., Experiment #1) which is trained from a very large and diverse dataset, where the gLoss/dLoss value tend to stay reasonably stable throughout training.
2. Degradation from further training. Training for longer does not always create better results.
4. Do we gain any advantage by training DCGAN on video samples?
Data acquired from video samples is naturally consecutive (assuming that the sampling frame rate used is not too low), where adjacent frames are largely identical. We can look at this from several angles to see what we gain from this:

1. From the perspective of faster convergence during training or achieving higher quality results, the benefit is not yet clear. More experimentation are needed in this area.
2. From the perspective of generating an avatar with finer expressions, such mouth movements while the subject is talking, we believe that the use of video sample is a must. This is because only video can provide the timing and detailed information needed for recreating finer expressions. This is a topic for further research.
5. Can DCGAN be used to create high-quality and dynamic model of a specific object beyond faces?
The experiments given in this report is our first attempt in applying DCGAN to create a neural model of faces, and not just for creating some sort of interpolated images. We are hopeful that there are many interesting researches in this direction beyond just modeling faces.

Our contributions

We make the following contributions in this work:

1. We show a practical application of DCGAN in the form of building the neural model of a specific physical object from images or videos without supervision. Such a neural model can conceivably be used as a form of representation for certain visual aspects of a specific physical object.
2. We show that by interpolating in the Z representation of a trained DCGAN model, it is possible to synthesize photo-realistic animation of the specific object used for training.
3. We show that with the approach outlined above and using human faces as the subject matter in this study, we are able to synthesize photo-realistic animated expressions from limited training dataset with good result. With further work such a technique can conceivably be extended to create a full photo-realistic avatar of a person.
4. We point out a practical bottom-up approach for applying the DCGAN technology. I.e., instead of using DCGAN to interpolate from large amount of images with many varieties, we can instead focus on one very specific object and create a detailed model out of it. For the longer term we can then accumulate and extend many such detailed models towards certain practical usage.
5. We offer an approach to alleviate the inherent under-fitting problem associated with very small training dataset through the use of a re-usable DCGAN model. In our experiment we built a Universal Face Model (UFM) which represents a prototypical neural model for all human faces, we then use such a UFM to train new and small dataset for building a new avatar. We show that the use of UFM helps to alleviate the said under-fitting problems.
6. We show that through the use of a Universal Model the training time for the neural model of a new subject can be substantially reduced. For the purpose of building avatar, this means that when we wish to create an avatar for a new person, then using UFM will take only a fraction of the time needed when compared with training from scratch.
7. We point out a promising future research direction where videos can be used as the training dataset for DCGAN.
###### Going Forward

It is my belief that DCGAN and its extensions can be used for building the neural models of our physical world, unsupervised, from images and videos.

Here we take the first baby step using human faces as the subject matter for study, and have managed to build neural models for human faces using DCGAN, then subsequently use such neural models to create photo-realistic and animated expressions.

While we have not yet built an avatar with full range of expressions, we have demonstrated that the approach holds a great deal of promise. Viewing strictly the perspective of creating an avatar using the DCGAN approach, there are still much to be investigated. More specifically:

1. Add controlling element so that another program can actually treat the neural model as a dynamically controllable avatar. So far in this work we have demonstrated the fundamentals of synthesizing piecemeal expressions out of limited images or videos which is necessary for building an avatar, but we nonetheless have not yet provided the dynamic control mechanism.
2. Generate images in much higher image resolution. Current experiments operate on training images at the resolution of 200x200 pixels or less, which is fairly grainy.
3. Add automatic segmentation capability for learning from parts of an image.
4. Automatic separation of spurious factors, such as lighting, clothing, background, hair style, etc.
5. Learn from videos in higher frame rate. In the Experiment #4 above we use a frame rate of one per second. This is in part due to the preliminary nature of this research, and obviously we have lost a great deal of information with such coarse sampling.
6. Transfer of feature or expressions. The Radford paper has demonstrated the possibility of operating on the vector representation that results in the transfer of visual features between images, such as adding sunglasses or a smile. It would be interesting to demonstrate that we can make avatar A smile like avatar B through such principle.
7. Create avatar with full range of controllable expressions through unsupervised learning.
8. Acquire finer expressions through learning, such as the those around the mouth when the subject is talking. Currently we are able to handle only relatively simple expressions, such as going from neural to smiling, or turning of the head, etc. Dealing with fine expressions likely will require us to extend DCGAN further, perhaps into the temporal domain.
9. Perform multi-modal learning, e.g., acquire the relationship between the speech/text and facial expressions.
10. Convert neural models to 3D models suitable for VR/AR devices or 3D printers. This is an exciting area, since perfecting this would afford us a unsupervised method for creating large amount of dynamic and realistic 3D models needed for supporting a rich VR/AR world.
###### Resources
1. Goodfellow et al., Generative Adversarial Nets, 2014.
2. Radford et al., Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, 2016.
3. Salimans et al., Improved Techniques for Training GANs, 2016.
4. Nicholas Guttenberg, Stability of Generative Adversarial Networks, 2016.
5. John Glover's blog, An introduction to Generative Adversarial Networks (with code in TensorFlow), 2016.
6. Casper Kaae Sønderby's blog, Instance Noise: A trick for stabilising GAN training, 2016.
7. A good introduction to DCGAN from OpenAI
8. StackOverflow, How to auto-crop pictures using Python and OpenCV
9. The DCGAN implementation used in this report: a Tensorflow implementation of DCGAN, contributed by Taehoon Kim (carpedm20).
]]>
<![CDATA[Image interpolation, extrapolation, and generation]]>

###### Introduction

Our ultimate goal is to generate 3D models out of textual or verbal commands. Here we tackle (for now) the simpler problem of generate 2D images, before moving on the more complex problem of dealing with 3D models.

There have been some recent research that are relevant to the

]]>
http://www.k4ai.com/dcgan/4576dfc3-a94d-4c85-a307-b594875d39caMon, 07 Nov 2016 00:15:33 GMT

###### Introduction

Our ultimate goal is to generate 3D models out of textual or verbal commands. Here we tackle (for now) the simpler problem of generate 2D images, before moving on the more complex problem of dealing with 3D models.

There have been some recent research that are relevant to the generation of 2D images that can also handle lighting, poses, perspective, emotions (for facial images). In particular, DCGAN shows promise as a way for discovering high-level image representation through unsupervised learning, which is highly relevant to our goal here. In this post I will survey these various researches in order to find a direction towards the stated goal.

This post is part of the How to build a Holodeck series, which is a long-term crowd-driven open-source project (abbreviated to the name HAI below) that I am working on. The posts in the series serve as a working document for sharing ideas and results with the general research community.

About the HAI project While we have a fairly long-term goal for the crowd-sourcing HAI project where we want to generate 3D models out of textual descriptions, here as the first step we want to reduce it to simpler core problem of generating 2D images from textual descriptions.

###### Case #1: CNN+DNN

This paper Learning to Generate Chairs, Tables and Cars with Convolutional Networks proposes a method for learning from 2D images of many types of objects (e.g., chairs, tables, and cars, created out of 3D models for experimentation), and is then able to generate realistic images with unseen styles, views, or lighting. The method is based on a convolution-deconvolution (abbreviated to CNN+DNN below) architecture.

The following show a model from the paper. Goal of the model is to reconstruct the given image and segmentation mask, when given the input parameters. The input parameters include the model identity defining the style, the orientation of the camera, and other artificial transformation (e.g., such as rotation, translation, zoom, stretching horizontally or vertically, changing hue, changing saturation, changing brightness.)

This model works as follows:

1. (Layers FC-1 to FC-4) The input parameters are independently fed through two fulling connected layers, then concatenated and fed through two fully connected layers to generate a shared high dimensional representation h.
2. Layers FC-5 and uconv-1 to unconv-4 then generate the image and segmentation mask in two independent streams from h.
3. The network is trained by minimizing the error of reconstructing the segmented-out chair image and the segmentation mask.

The challenges here are:

1. Can a high-level representation be learned through such a model? Put in plain language, if we ask the model to interpolate between two know chair styles (or other parameters such as orientation, etc.), will we get something that looks like a reasonable chair?
2. How extensible is this method to natural training images that may have random background, inconsistent lighting, etc.

From chairs to faces

zo7/deconvfaces: is a Python implementation of the paper above, posted by user Michael D. Flynn. The said method was adapted for interpolating from the images of human faces with interesting results.

Interpolating between multiple identities and emotions: same lighting and pose (i.e., facial orientation).

Relevant resources for the deconvfaces experiment:

1. The Extended Yale Face Database B: the uncropped and cropped versions are supported by the zo7/deconvfaces implementation above.
3. Additional experimental result from applying deconvfaces on the Yales Face Database B, posted by Michael Flynn on imgur and YouTube
4. Blog by Michael D. Flynn.

Following are some experimental results reported by Michael D. Flynn.

Interpolate between mixed identities and emotions, based on the Radboud Faces Database,

Interpolate on lighting, based on the Extended Yale Face Database B.

Interpolate on poses, based on the Extended Yale Face Database B.

Significance
From the perspective of the HAI project, this method is significant in the following areas:

1. It is able to acquire high-level representation of images. Such high-level representation is essential to the goal of performing various manipulation order to meet a user's request.
2. It is able to generate reasonable interpolation from given images. This is a sign that the acquired representation is an effective one. This type of capability will allow HAI to generate infinite variations of the target image in order to meet a user's request.
3. It is able to perform some form of extrapolation. From the paper:

The chairs dataset only contains renderings with elevation angles 20◦ and 30◦, while for tables elevations between 0◦ and 40◦ are available. We show that we can transfer information about elevations from one class to another.


Such capability in extrapolation, or generalization, in critical in reducing the amount of learning that is needed.

4. The deconvfaces experiments with human faces show that realistic lighting and poses can be interpolated. This shows promise that it is perhaps possible to generate realistic 3D models out of such 2D images.
###### Case #2: DCGANs: unsupervised learning of image representation

This is a class of CNNs called deep convolutional generative adversarial networks (DCGANs), which can be trained on image datasets, and show convincing evidence that its deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. It has also shown great promise for generating realistic looking images.

Following are realistic images of bedroom scenery generated (from the paper) (https://github.com/Newmu/dcgan_code)

While the images above look nice, how do we know that it is meaningful? Again, by looking at how good it interpolates between images we can get a sense of whether it has learned a good image representation.

Following is an experimental result showing a series of interpolation between 9 randomly selected images. The significant part here is that all of the images look reasonably realistic, and the in-between transitions (say, from TV to window, or a windows created from a wall) look plausible. This is as opposed to previous methods which might just create a blurred morph between images.

As described in the paper, DCGAN is capable of learning a hierarchy of image representation through unsupervised learning. What does this mean, and why is it important for the HAI project?

As mentioned above, our goal is to allow realistic images (and eventually 3D models) to be created and manipulated through verbal commands. In order to allow images to be manipulated in complex ways toward such a goal, an image cannot be treated merely as a collection of pixels. But rather an image somehow has to be transformed into a hierarchy of parts, and moreover such a transformation has to be learned mostly unsupervised by the system itself.

Vector arithmetic

Following is an example that demonstrates the image representation learned by DCGAN, where the representation allows DCGAN to apply sunglasses on a female face from what it has learned from other types of faces, even if it has never seen a woman with sunglasses before.

This is an indication that:

1. DCGAN has learned, unsupervised, how to break down the training images into meaningful parts (i.e., facial features are separate from sunglasses); and
2. DCGAN is capable of performing operations based on such a representation (e.g., applying the sunglasses on a male face to a female face) and reasonable reasonable result. This is in many ways reminiscent of how Word2vec is able to learn word representation from text, so that vector operations on its representation like Brother-Man+Woman would yield Sister.

So in a sense, DCGAN already can be viewed as a precursor of the HAI system, where (with some additional training about verbal commands) it is perhaps possible to instruct it to manipulate faces towards what a user wanted.

Related experiments

1. Here is a DCGAN implementation based on TensorFlow.
2. Here is a blog Image Completion with Deep Learning in TensorFlow showing how DCGAN can be used for image completion, where part of the image can be erased or added in a realistic manner.

Significance

DCGAN is important for the following reasons:

1. It is capable of generating a representation from training images, unsupervised.
2. It is capable of generating realistic images
3. It is capable of generating realistic interpolations
4. The Word2vec-like vector operation capability (see the woman-with-sunglasses example above) is intriguing, since it points to the possibility of a rich representation that can do much more than a simple one.

Open questions

1. Can DCGAN support some form of extrapolation? Can the image completion example above be considered a form of extrapolation, and how can it be further extended?
2. How far can we push the vector operation on this representation? How can we extend it to 3D?
###### Case #3: Generate Images from Text

This method uses the DCGAN approach for generating realistic images from text. Following are partial result displayed in the paper:

How it works

It trains a DCGAN conditioned on text features encoded by a hybrid character-level convolutional recurrent neural network. Both the generator network G and the discriminator network D perform feed-forward inference conditioned on the text feature.

Significance

1. Needless to say, this feels pretty much like a primitive Holodeck, where the system creates the target image based on textual descriptions.
2. Furthermore, this system is also capable to separate style from content (i.e., foreground and background information in the image).
3. Capable of pose and background transfer from query images onto text descriptions.
4. Can be generalized to generate images with multiple objects and variable backgrounds.
###### Case #4: Filling in Details with cGAN

How do we create convincing visual details for a specific object from little information?

The 2016 paper by Isola et al, Image-to-Image Translation with Conditional Adversarial Nets demonstrates the use of Conditional GAN to generate convincing details from sketchy information, as shown below from the paper:

which displays six pairs of images, with the left image being the input to cGAN, which then generates the image at right.

How it works

The standard GAN generator G learns a mapping from random noise vector z to output image y, i.e., G:z→y. In contrast, cGAN learns a mapping from observed image x and random noise vector z to y, i.e., G:{x,z}→y.

Significance

1. This give us a starting point for generating details of a specific object or environment on demand.
2. The use of a conditional term (i.e., the x in cGAN) may allow us to have more control in the behavior of the system.
###### Case #5: Synthesizing facial expressions from video/image sample

By applying standard DCGAN but to image or video samples of a specific person, I was able to create some sort of a neural model representation, and then use the model to generate sequences of non-existent photo-realistic facial expressions for the person.

While this does not involve technical innovation beyond the standard DCGAN, it does represent a novel way of applying the DCGAN towards specificity (i.e., the facial expressions of a specific person), and not generality (i.e., for generating arbitrary bedroom scenes).

###### Summary

So we have surveyed a number of promising researches above, from which we might be able to borrow some ideas and extend them further in order to achieve what we needed for the HAI project.

Following are what we have learnt:

1. The DCGANs (and their variations) show a promising general direction for the HAI project.
2. It would seem that it is possible to generate a meaningful image representation out of it, where operations such as interpolation, extrapolation, and vector operations can be carried out with good quality. Such operations are essential for the HAI project.
3. Case #3 demonstrates that it is possible to separate image background (called style in the paper), and apply it to another context. This is critical for image composition in HAI.

Following are the possible future directions to proceed, where we wish to answer the following questions:

1. If we extend upon the generative approach in DCGAN or the conv-deconv methods, but train entirely on the photos of a single person (as opposed to the wide variety approach adopted in most previous experiments) in order to create a highly polished and manipulatible neural model of such a person? More specifically:

1. Can such highly polished neural model of such a person encompasses expressions, poses, ages, and lighting?
2. What does it take in order to transfer such parameters to another identify?
3. Can it learn to remove spurious information, such as the background?
4. Would DCGAN work well on video of a single person? Would the implied object persistence (i.e., the man in frame N and the man in frame N+1 is are most likely the same person) be beneficial to the training process in some way?
5. The Case #3 above shows that multi-modal DCGAN is a promising method in discovered complex relationship between text and images. How can we extend this into the domain of interactive discourse, so that it is possible generate the target image through incremental textual commands?

Such questions will be explored in a separate post.

2. What does it take to be able to* manipulate parts* of an image. For example, in the chair example above the system needs to be able alter only part of it (e.g., the arm rest) per request.
3. Need the capability to reason about relationship between parts of an image, such as understanding even spacing, distance, top/down/left/right relationships, etc.
4. Find a way to accumulate relevant knowledge incrementally, so that we don't have to retrain from scratch every time.
5. What does it take for the system to learn conversational interactions, so that the the target image to be generated through a sequence of interactive textual commands? Case #3 points out a direction, although there are still much to be done. Note that here we wish to have the system learn everything without hard-coded knowledge, if possible.
6. What does it take to achieve one shot learning?
7. What does it take to achieve 3D representation, perhaps in a way similar to what DCGAN made possible for 2D images?

Going forward: we will further pursue and extend the research mentioned in separate posts, including hands-on testing with actual implementations.

###### Other resources

The following are kept here because they are potentially useful, but still pending further investigation:

1. Paper: How Do Humans Sketch Objects?.
Question: can DCGAN be used to create realistic sketches?
2. Paper: Precomputed Real-Time Texture Synthesis with
: With adversarial training, we obtain quality comparable to recent neural texture synthesis methods. As no optimization is required any longer at generation time, our run-time performance (0.25M pixel images at 25Hz) surpasses previous neural texture synthesizers by a significant margin (at least 500 times faster). We apply this idea to texture synthesis, style transfer, and video stylization..... An important avenue for future work would be to study the broader
framework in a big-data scenario to learn not only Markovian models but alsoinclude coarse-scale structure models.
]]>
<![CDATA[Machine Learning on Google Cloud and AWS/EC2, Hands-on]]> Here we look at running computing-intensive machine learning jobs using Google Cloud Platform (GCP) with TensorFlow, and also doing the same on AWS/EC2 GPU instances, from the perspective of cost effeciency, training time, and operational issues.

This investigation is part of my effort in the open-source project terraAI, which

]]>
http://www.k4ai.com/cloudml/e2e266b1-b556-4998-be74-a3f6feb86e28Mon, 17 Oct 2016 20:03:00 GMT

Here we look at running computing-intensive machine learning jobs using Google Cloud Platform (GCP) with TensorFlow, and also doing the same on AWS/EC2 GPU instances, from the perspective of cost effeciency, training time, and operational issues.

This investigation is part of my effort in the open-source project terraAI, which requires a great deal of computing power for Machine Learning.

1. For information on how to set up the GCP+TensorFlow environment, please see my previous post.
2. For information about researches related to the DCGAN machine learning systems tested, please see a separate post here. The possible application of DCGAN is investigated in the How to Build a Holodec series.

The following was recorded in October 2016. Since I expect GCP and TensorFlow to evolve quickly, the following information may not be applicable after a while.

###### Preemptible/Spot instances

Both AWS/EC2 and GCP offer substantial discount on interruptible VM instances which use the platform's excess computing capacity. These are called the preemptible instances on GCP, or the spot instances on AWS/EC2. Such instances could get terminated at anytime due to system events (such as when the available capacity is tight), bid price exceeded (AWS/EC2), time limit exceeded (no more than 24 hours on GCP), etc.

Following is a pricing example for GCP:

Machine type:　n1-highcpu-326
vCPUs: 32
Memory: 28.80GB
GCEU: 88
Price (USD) per hour: $0.928 Preemptible price (USD) per hour:$0.240


As can be seen above, the discount is quite substantial. It is also worth noting that AWS/EC2 supports a biding mechanism, so that it is possible to bid with a lower price for spot instances if you are willing to wait for better prices.

Note the following limitations for GCP:

Preemptible instances cannot live migrate or be set to automatically restart when there is a maintenance event. Due to the above limitations, preemptible instances are not covered by any Service Level Agreement (and, for clarity, are excluded from the Google Compute Engine SLA).


Handling instance interruption is very important, for otherwise you may lose the result from the training sessions that take many days to run.

Following are some notable differences between the two platforms on dealing with instance termination:

1. AWS/EC2's spot instances can get terminated on a two-minute warning (as opposed to the more predictable 24-hour limit one GCP). On AWS/EC2 the main cause for instance termination is when your bid price is exceeded by the market price (which changes all the time).
2. AWS/EC2 requires you to poll for termination notice, which is more cumbersome than GCP's asynchronous notification mechanism.
Details about the AWS/EC2 spot interruption polling mechanism can be found here. Following are what happen when an instance is preempted:

Following are what happen when a preemption occurs on GCP:

1. GCP's Compute Engine sends a preemption notice to the instance in the form of an ACPI G2 Soft Off signal. You can use a shutdown script to handle the preemption notice and complete cleanup actions before the instance stops.
2. If the instance does not stop after 30 seconds, Compute Engine sends an ACPI G3 Mechanical Off signal to the operating system.
3. Compute Engine transitions the instance to a TERMINATED state. You can simulate an instance preemption by stopping the instance.

Per TensorFlow documentation, an AbortedError exception is raised in case of such preemption.

Restarting an instance

To restart a spot/preemtible instance after interruption:

1. GCP: this is a simple matter of restarting the stopped instance from the console. There is no need to reconfigure a new preemptible instance.
2. AWS/EC2: it can be pretty tedious on AWS/EC2 in some situations:
1. If the instance is defined as an one-time spot instance, then you will need to relaunch a new spot instance, and going through all the configuration choices, which is a chore. Wish AWS can provide a way to save such configuration choices so that it is possible to relaunch an one-time spot instance with just one click.
2. If the instance is defined as a persistent spot instance then it could get restarted automatically when the condition is right. Here you need to make sure that your instance is configured correctly to get your Machine Learning job going automatically on reboot, for otherwise you will be wasting money with the instance staying idle.
3. There is no way to temporarily pause (or stop, in AWS/EC2 terminology) a spot instance. The best you can do is to terminate the instance then relaunch it (and going through the hassle of having to reconfigure the launch).
###### Pricing for GCP On-demand Instances

As a comparison, pricing examples for the GCP on-demand instances are given below:

1. GCP Debian VM instance, 1 vCPU, CPU: Intel Haswell, 3.75 GB, cost: ~USD$30/month 2. GCP Debian VM instance, 8 vCPUs, CPU: Intel Haswell, 30 GB, cost: ~USD$200-300/month

No GPU instance is available on GCP as of this writing.

The pricing above is for reference only. Since our goal is to run computing-intensive deep learning tasks, below we will compare GCP's 8-vCPU instances with AWS/EC2's GPU instance.

###### GCP vs AWS/EC2's GPU instances

Following are some results from running the same TensorFlow test case on GCP, and also AWS/EC2 with GPU.

Test case used

I used a DCGAN (Deep Convolutional Generative Adversarial Networks) implementation as test case below, mainly because of my interest in image (and later 3D models) generation (see my How to Build a Holodeck series. My thoughts about how to apply DCGAN towards such a goal can be found here.

1. The TensorFlow implementation of DCGAN of the paper Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
2. The Celebrity image dataset (celebA) is used.
3. The following test uses the AWS/EC2 Spot Instance and the GCP Preemptible Instance which are the much cheaper versions of the regular on-demand instance.

Environments tested

1. GCP:
• Hardware: 8 vCPUs, 15GB memory, 20GB disk. Note that as of this writing no GPU instance is available on GCP.
• OS image: Debian GNU/Linux 8 (jessie)
• Software: TensorFlow 0.11, Python 2.7, installed under Anaconda (v 4.2.9)
• Preemptible instances are used for lower cost.
• GCP storage is used to persist changes between VM instances.
2. AWS/EC2:
• Hardware: GPU instance g2.2xlarge (current generation), 8 vCPUs, 15GB memory, 60GB SSD
• OS image (AMI): ami-75e3aa62 ubuntu14.04-cuda7.5-tensorflow0.9-smartmind. This AMI comes with everything needed for the test pre-installed, except for scipy.
• Software: TensorFlow 0.9, Python 2.7, Cuda 7.5
• Spot GPU instances are used for lower cost

Results

Please note the following are rough comparisons. In particular the cost factors often vary greatly by region, are constantly adjusted, and might be affected by many different options (such as load balancing).

1. Time to execute one epoch
1. AWS/EC2 with GPU: 5032 seconds
2. GCP: 34505 seconds
2. Cost.
1. AWS/EC2:
1. Spot instance: USD$0.10 per Hour 2. On demand: USD$0.65 per Hour
2. GCP pricing:
1. Preemptible: USD$0.06/hour 2. On demand: USD$0.30/hour
###### Summary
1. The AWS/EC2 instances are substantially more cost efficient also many times faster than GCP in training the same DCGAN model using TensorFlow. While the Google Cloud Platform has many things going for it, the lack of GPU instance support (or perhaps TPU support one day) really makes it uncompetitive for training Machine Learning models at this time.
2. AWS/EC2's spot instances cannot be paused (i.e., stopped), but GCP's preemptible instances can. This means that if you have allocated a large and very expensive AWS/EC2 GPU spot instance (such as the p2.16xlarge which costs USD$144.0 per Hour), and you wish to pause it a bit for some reasons, then it is more problematic to deal in AWS/EC2 with than in GCP. Here you basically need to do the following before terminating an instance: 1. Make sure that your ML code is well written to checkpoint and reload partial results as needed. 2. Copy partial results to a persistent storage (such as a mounted AWS/S3 bucket) 3. Make an image out of the current instance, if you have installed or configured something that you wish the next spot instance to pick up. 3. AWS/EC2 offers several GPU tiers, including the following (spot instance pricing, as of this writing, all based on Linux, current version GPU): 1. g2.2xlarge USD$0.10 per Hour (tested above).
2. g2.8xlarge USD$0.611 per Hour 3. p2.xlarge USD$0.1675 per Hour
4. p2.8xlarge USD$72.0 per Hour 5. p2.16xlarge USD$144.0 per Hour

Curiously the spot pricing for p2.8xlarge and p2.16xlarge are much higher than the on-demand versions. Not sure why this is the case.

A strange ramp up effect was observed for the test case, where it seems to be unusually slow at the beginning. Details as follows (for the g2.2xlarge instance):

• If extrapolating from the first 2% of the epoch the cost should be USD$5.9/epoch • If extrapolating from the first 10% of the epoch then the cost should be USD$1.93/epoch
• If extrapolating from the final 10% of the epoch then it should take 5127 seconds to run one epoch, about the same as a g2.2xlarge instance but at a much higher cost.

As such the computing time are extrapolated from the latter half of the first epoch to represent the steady-state throughput.

Using the test case above with all other parameters staying the same, the following are partial results for running the test (measured in cost per epoch).

1. g2.2xlarge: 5032 seconds/epoch * USD$0.10/hour = USD$0.14/epoch
2. g2.8xlarge: 5788 seconds/epoch * USD$0.611/hour = USD$0.98/epoch. It is unexpected that this turns out to be slower than g2.2xlarge. It is suspected that there is some kind of configuration error, but none were found.
3. p2.xlarge: 3795 seconds/epoch * USD$0.1675/hour = USD$0.177/epoch.
4. p2.8xlarge: unable to test due to There is no Spot capacity for instance type p2.8xlarge in availability zone
5. p2.16xlarge: Not tested.
4. Persistent storage. In my tests I use persistent storage (i.e., AWS/S3 buckets, or Cloud Storage on GCP) for storing computing results independent of the VM instances. This is a very handy arrangement, but the following should be noted:
1. Such persistent storage are much slower (for both AWS and GCP) than the local disk on a VM instance. For example, I have found that simple operations (listing, unpacking, moving, reading) on a large dataset with 200000 images could take hours or days (!). I ended up putting such datasets on the local disk, which also means that I need to create launchable image that include such dataset, so that the next VM instance can pick it up. This is far from ideal.
###### Recommendation
1. Overall AWS/EC's g2.2xlarge seems to be a good value if you are on a budget. It is the least powerful current version GPU as offered on AWS/EC2, but once you have it set up you can easily scale to a higher GPU where you can pay more for speed. If you are not in a hurry running your experiments, then one strategy is as follows:
1. Use spot instances which cost a fraction of the on-demand version.
2. Make sure that your program checkpoints its vital contents often, and that it can stand up to frequent unexpected termination and restart. Luckily TensorFlow has good support for saving and restarting models, so most programs written in TensorFlow are pretty good in this respect.
3. Set a sport instance bid price for around half of the on-demand instance price. This way, your spot instance won't get terminated too often, while you still can take advantage of the long stretches of low price that is often available for spot instances.

et a reasonable bid price about half of the on-demand version, so that it does not get terminated too often, and
1. The top-end GPU instances, such as AWS/EC2's p2.16xlarge or p2.16xlarge, are not cheap. If you plan on running heavy machine learning jobs constantly, than buying your own GPU (e.g., the NVIDIA GeForce GTX Titan X) could be more cost effective. However, the cloud environment makes it much simpler to scale up and down computing power at will, and also simplifies access, monitoring, and management. Which approach is better really depends on the weight that you give to each factor (e.g., cost, convenience, scalability, ease of management, etc.).
1. I have used a TensorFlow test case in my experiments above, mainly because TensorFlow is a, open-source software library designed for scalability. If you expect that your Machine Learning system needs to be deployed in very large scale one day, then I'd recommend that you also implement your code based on TensorFlow.
1. Keep your eyes on GCP, since while it is not very useful for doing Deep Learning researches at this time, I expect/wish that it will catch up with AWS/EC soon. Note that GCP does offer a range of Machine Learning services which are supposedly highly scalable, but since my interest is in conducting ground-breaking researches, for my purposes I have no need for those pre-packaged services.

1. Hands-on with TensorFlow on GCP - set up: my experience with setting up a Machine Learning environment using the Google Cloud Platform.
2. Image interpolation, extrapolation, and generation: looking into the possibility of using the DCGAN for the purpose generating images (and eventually 3D models) from textual commands. This is part of the How to Build a Holodeck series.
3. How to Build a Holodeck.
]]>
<![CDATA[Hands-on with TensorFlow on GCP]]>http://www.k4ai.com/gc-setup/da9b102c-df9c-4e0e-b17d-abcd4d411e28Sun, 16 Oct 2016 01:08:00 GMT

Following is my experience with the Google Cloud Platform (GCP). I am already familiar with Amazon's Elastic Compute Cloud (EC2), so this investigation will help me decide which platform better suits what I needed for my own terraAI project.

The following was recorded in the October of 2016. Since GCP and TensorFlow are likely to evolve quickly, I expected that the following information could become outdated fairly quickly.

Please note that I am approaching this from the point of view of using GCP+TensorFlow for Machine Learning researches, and not general computing. However, it is likely that a good part of this could be useful to anyone who wants to use GCP for other purposes.

The TensorFlow tested as of this writing was version 0.11.

###### Why GCP?

Several reasons prompted me to look into GCP:

1. GCP explicitly supports scalable Machine Learning services through TensorFlow, which seems very useful to machine learning tasks that require a lot of computing power.
2. GCP offers credits of $20,000-$100,000 for startup companies. You might want to look into it to see if it is applicable to you.
3. Google is offering a two-month free trial with USD$300 credit. ###### Doing the Free Trial 1. Go to GCP's homepage, click on the TRY IT FREE button and follow the instructions there (ref: GCP docs) to set up the basic environment. It takes perhaps an hour to get everything set up, which was a little tedious but overall the instructions were pretty clear. 2. Set up an instance for testing out the Google Could Machine Learning API (GCML). Note that this involves enabling the relevant API from the Cloud Shell. Cloud Shell is basically a browser-based terminal console for your server instance, which is based on Debian Linux. Note that during this process you will be creating a GCP bucket for persistent storage (similar to the Amazon S3). 3. Testing out GCML using Training Quickstart, which is a canned example using the MNIST dataset. Training: running this test example takes only a couple of seconds through the initial Cloud Shell (i.e., without submitting the task to the a new VM instance). Inspection: a tool called TensorBoard is available for inspecting the ML model and result. This tool is launched as a web server by typing the command "tensorboard --logdir=data/ --port=8080" in Cloud Shell's command line, the UI (which is a browser client) can then be launched by clicking on the Web preview button on the menu of the Cloud Shell. Following is a sample view of the computation graph on the TensorBoard: More information about this TensorFlow tutorial can be found here. TensorBoard supports interactive inspection and manipulation, which is very nice. Quick Observations 1. No GPU instances are available at this time. 2. The available instance tiers are not as extensive as EC2's. 3. The Cloud Shell often feels sluggish, where merely echoing the textual commands entered could take a second or two, which is quite annoying. 4. The pricing for standard GCP bucket storage is USD$0.026 per month per GB, which is roughly comparable to Amazon's S3 (which is at USD$0.03 per month per GB). 5. What you can access through the Cloud Shell is essentially a very small VM instance that is allocated automatically for you. Although you probably can do a wide array of things with it (e.g., installing packages, run programs, etc.), you should use the Cloud Shell principally as a management console, since it offers only very small computing capacity and it is also ephemeral. 6. For serious local computing that is not yet ready for the cloud, you should create a real instance which I will discuss later. ###### Setting up a new VM instance The FREE TRIAL process above gave me a quick taste of what GCP offers, so my next step is to create a 'new' virtual machine instance for more serious computing. My purpose here is to develop and conduct training of a certain Machine Learning model, as such it should have reasonable computing power. A default "real" instance will cost around USD$30 per month. Such an instance can be stopped when not needed, in which case it should then cost little based only the storage required to keep it.

The configuration and cost for the default VM instance created is as follows:

1. 1 vCPU with 3.75 GB memory @ nominal cost of $36.50/month 2. 10 GB standard persistent disk$0.40/month
3. Sustained use discount: $10.95/month 4.$25.95 per month estimated, Effective hourly rate $0.036 (730 hours per month) 5. Deian GNU/Linux 8 (jessie) Somehow trying to set up a real VM instance for the ML package turned out to be much more tedious than I had expected. Following are what I had to go through: 1. It wasn't clear how to do this from the GCP Dashboard after I was done with the quick trial, so I had to google separately to find the instruction. Click on the 'LOCAL:MAC/LINUX' tab there, and find the first step "Install Miniconda for Python 2.7". 2. Launch a browser shell from a selected instance on the GCP Compute Engine console. This shell is annoyingly slow but it will do for now since I had trouble setting up SSH access (see below). 3. Follow the direction (see the "Linux Anaconda install" section) to download an installation script Anaconda-latest-Linux-x86_64.sh 4. Execute the installation script and got an error message that bzip2 is missing. 5. Execute 'sudo apt-get install bzip2' to get it installed, then install Anaconda again. 6. Got error message that the directory 'anaconda2' already exists. Remove the directory then execute the installation script again. 7. Close and reopen the shell as instructed to get the installation to take effect. This got the anaconda installed successfully. 8. Now back to the "Setting up your environment" page, execute SIX more steps to get various components installed. For the last step that installs TensorFlow I got the following error message. Installing collected packages: funcsigs, pbr, mock, setuptools, protobuf, tensorflow Found existing installation: setuptools 27.2.0 Cannot remove entries from nonexistent file/home/kaihuchen01/anaconda2/envs/cloudml/lib/python2.7/site-packages/easy-install.pth  9. At this point invoking python then try 'import tensorflow' got 'ImportError: No module named tensorflow' so the installation obviously failed. 10. Googling around for solutions and found this. Follow the "frankcarey commented on Jan 9" entry solved the problem. This is great, but we are not done yet. 11. Next do the step Install and initialize the Cloud SDK using the instructions for your operating system. There are 14 (!) steps there. 12. Moving along and reaching the step gcloud components install beta and got the following error message: You cannot perform this action because this Cloud SDK installation is managed by an external package manager. If you would like to get the latest version, please see our main download page at: https://cloud.google.com/sdk/ ERROR: (gcloud.components.install) The component manager is disabled for this installation  decided to just ignore it and move on. 13. On this step 'curl https://storage.googleapis.com/cloud-ml/scripts/check_environment.py | python' I got the following error message: ERROR: Unsupported TensorFlow version: 0.6.0 (minimum 0.11.0rc0). Perhaps installation on TensorFlow failed eralier.  Reinstalling it somehow made the problem go away. 14. Got error When trying to verify with import tensorflow in python. Googling around and found this, using 'pip install -U protobuf=3.0.0b2' to solve the problem. 15. At this point get finally have 'Success' at the 'Verifying your environment' step. Seriously, Google? This whole process really need to be made much simpler. ###### Google Cloud SDK The Cloud SDK is described as follows: The Cloud SDK is a set of tools for Cloud Platform. It contains gcloud, gsutil, and bq, which you can use to access Google Compute Engine, Google Cloud Storage, Google BigQuery, and other products and services from the command-line. You can run these tools interactively or in your automated scripts.  While it is not stated explicitly, this in fact allows you to manage GCP from your own computer, instead of doing so from the Cloud Shell through the browser. You don't need to install this if your purpose is to do a quick test. ###### Generating SSH key pair For real work it is vital to be able to transfer files to/from GCP (e.g., using a tool like winscp), as well as being able to use a native SSH client (e.g., putty) for accessing GCP. This is not needed if your purpose is to do a quick trial of GCP. The first step for doing FTP and SSH is to generate a SSH key pair. The GCP documentation describes two ways of doing it. Generated SSH key pair using puttygen (failed) This involves using a native puttygen program on my Windows PC. Following the "To generate a new SSH key-pair on Windows workstations" steps in this instruction, I got the following error message: Invalid key. Required format: <protocol> <key-blob> <username@example.com> or <protocol> <key-blob> google-ssh {"userName":"<username@example.com>","expireOn":"<date>"}  Abandoned. Update (2016.10.17): based on experience below with setting up FTP and SSH, it is likely that this was due to the fact the key pair generated in puttygen is not of the right type. However the instruction is not clear in this respect. Overall it is still easier doing this on the VM instance (see below), since you then do not have to bother with doing relevant configuration on the VM instance. Generated SSH key pair on the VM instance (works!) 1. Open a browser-based session to the target VM instance. 2. Issue the command gcloud compute ssh instance-1, where instance-1 is the name of my instance. It reports that there is no SSH key for the Google Compute Engine, and proceed to generate a rsa key pair. 3. Enter passphrase. 4. In the end it reports: ERROR: (gcloud.compute.ssh) Could not SSH to the instance. IGNORED This is the key pair that we are using for the FTP and SSH setups below. ###### Accessing GCP through Winscp For setting up winscp (my favorite FTP client), do the following: 1. Download the the private key generated above in /home//.ssh/googlecomputeengine. Actually I just display it in the console then cut-and-paste it into a local file. 2. Use puttygen to convert the private key file from SSH2 format to PUTTY format (as required by winscp) 3. Create an access entry in winscp, using the private key above, as well as the instance's external IP address. 4. Connect to server instance Key caching: If you created your key pairs with a passphrase then you will be prompted to enter the passphrase every time you connect through Winscp or a SSH client. This could be quite annoying since such sessions do time out quite often, and you will be forced to enter the passphrase every time. A good solution for such a problem is to used a tool that caches the private key, such as Pageant. ###### Accessing GCP through SSH 1. Use puttygen to convert the private key file from the old SSH2 format to a more updated format (as required by KiTTY or PUTTY) 2. Create an access entry in KiTTY and configure it with the private key above, as well as the instance's external IP address. 3. Connect to server instance. ###### Accessing Cloud Storage from Python The Google Cloud Storage offers persistent storage similar to Amazon's AWS S3. Creating a GCP bucket (called variably as 'bucket' or 'disk' in GCP) is easy from the GCP admin console. Here my goal is to set it up so that I am able to access my bucket from a Python program. This turns out to take more effort than I had expected. Following is a log of my quest: 1. First we need to install the client library for Python. The starting point is this page: Goolge API Client Library > Python 2. The above page points to this page, which instructs me to execute the command pip install --upgrade google-api-python-client, this went without a hitch. 3. At this point if I try to execute python -c "import cloudstorage" I get the error message 'ImportError: No module named appengine.api'. Does this mean that somehow the Cloud SDK hasn't been installed correctly? Trying to reinstall it following direction did not help at all. Abandoned for now. ###### Mounting Cloud Storage as a local file system So how can we mount a cloud bucket as a local file system on a given instance? Doing so is vital for my target setup of having many VM instances being created and terminated as needed, but with a persistent storage for all information shared among these instances. Following are the steps: 1. Follow the steps here to create and mount GCP cloud storage. 2. At some point it indicates the need to install the gcsfuse tool on the instance which leads to these instructions here . Beware that when the instruction demands the execution of the gsutil command, but it may report error unless sudo gsutil is use instead. After this is done it is then possible to do things such as having multiple VM instances accessing shared data sets stored on a bucket. However, please beware of the following pitfall (excerpt from the instruction page): Note: Cloud Storage is an object storage system that does not have the same write constraints as a POSIX file system. If you write data to a file in Cloud Storage simultaneously from multiple sources, you might unintentionally overwrite critical data.  Persisting the mounted drive The above instructions only get a bucket mounted for the session, which means that it will not get mounted automatically next time when the instance is rebooted, which is not good. The GCP does not make this information readily available, so I had to google around to find out how to achieve this. Found some information here. It seems by default gcsfuse allows only the user who mounts the file system to access it, but when you put an entry in fstab to get it auto-mount on reboot the root user will own the file system, thus preventing others from accessing it. The following is what works and the lessons learned: 1. Edit /etc/fstab to auto-mount a bucket on reboot, but do not just edit and then reboot the system since you might brick the instance and had to ditch it. Better create a snapshot before you do this. 2. For me the following fstab entry works (until I rebooted the system, that is): console-xxxxxx.appspot.com /mnt/mybk gcsfuse rw,user,allow_other  where console-xxxxxx.appspot.com is the GCP name for my bucket, and /mnt/mybk is the mount point on my file system. The allow_other flag allows the root to do the mounting while still let other users access the bucket. 3. Do not just reboot in order to test the updated fstab, since you may brick the whole instance if there is something wrong with the change. Instead use the command sudo mount -a to test it out first. 4. Up to this point everything worked well for me, i.e., I was able to get the bucket mounted through a new entry in fstab and using sudo mount -a to verify (i.e., without rebooting). Files in the bucket are accessible as expected. However, the instance is bricked as soon as I reboot it. I tried these three times all with the same result. There is something called the serial console which is useful for limited diagnosis. This console is accessed from the VM Instances dashboard, find the instance in question, click on the SSH dropdown menu at far right and select View gcloud command, then you will be able to see the system's boot log and get some sense of what's going on. With help from the Google Cloud forum, following are the steps to make it work: 1. You must have 'noauto' flag in the fstab, otherwise the system is going to hang on reboot. You also need to have the dirmode and filemode there, otherwise the files in the bucket won't be writable. The fstab entry that works looks like the following: bucket-name mount-point gcsfuse rw,noauto,user,allowother,filemode=777,dir_mode=777 0 0 2. Add an entry mount mount-point in /etc/rc.local file to get the bucket mounted on reboot. Here are some relevant information about the permissions and ownership of the gcsfuse system: By default, all inodes in a gcsfuse file system show up as being owned by the UID and GID of the gcsfuse process itself, i.e. the user who mounted the file system. All files have permission bits 0644, and all directories have permission bits 0755 (but see below for issues with use by other users). Changing inode mode (using chmod(2) or similar) is unsupported, and changes are silently ignored. Solution finally With help from the Google support (see here) following is the correct solution: 1. Upon creating a new instance the full access scope must be specified. 2. The file /etc/fstab needs to contain the following entry: bucket-name mount-point gcsfuse rw,noauto,user,allow_other  3. The rc.local needs to contain the command mount mount-point so that the bucket will get mounted on reboot. 4. Modifying a file in the bucket will require the use of 'sudo', otherwise permission will be denied. One loose end is that it is still impossible to make a file executable (e.g., for a shell script), since sudo chmod simply fails silently. ###### Adding or Resizing local disk Can you resize the local disk attached to a VM instance, or adding more disks? This is quite important since you might start an instance at 10GB then later find that you need more space due to the requirement for some large datasets for ML computation. Fortunately the answer is yes and it can be done quite easily through the Compute Engine console (see instruction here). ###### Reserving static IP Address If there is a need to set up a server on GCP that accepts requests over the Internet, then it is vital to have a static IP address. The method for reserving a static IP address can be found here. Reserving a global static address may incur more charges. The pricing for reserved static IP addresses can be found here. ###### Managing ML jobs Machine Learning jobs usually take a very long time to complete. Following are some techniques that worked well for me. Some of these are not specific to GCP or ML, but nonetheless kept here as a reference: 1. The Linux Screen tool. This tool allows me to start a long-running job in a virtual 'window', detach from it to do other things or even close the SSH session, then come back later to re-attach to the same virtual window to check on the progress. Such operations involve only keyboard commands and is much easier than other options. Without such a tool I will either have to run manually from a terminal and risk accidental termination of the session, or have to do some more complex setup. ###### Multiple Instance Setup for ML When using a hosting service (such as GCP or AWS/EC2) for ML computing we face a dilemma: the server instances are charged by the hour and capacity, and a large instance is great for sustained heavy computation but too expensive if we are just spending time tweaking a model manually. One way to deal with this is with the following setup: 1. Create a small instance and configure it with all the software package needed for the task. This is where you'd do all manual tasks that do not require a lot of computing power. 2. Set up a persistent bucket (e.g., GCP's Cloud Storage, or Amazon AWS/S3) and mounts it as a local file system on the small instance. Put all of the code and data there. One side benefit of doing this is that you now also have almost unlimited storage space, without having to allocate new disks on the instance when you run out of space. 3. When the configuration for the small instance is stable, create a snapshot out of the small instance, then use it to launch a large instance that has a lot more computing power. This way you then do not have to go through the same tedious configuration process again. 4. Also mount the same bucket on the large instance as a local file system. This way all instances involved will have a shared storage space. Given the above setup, the following is then the typical workflow: 1. You'd normally be shutting down the large instance (or instances) so it will cost you very little. 2. For all the time-consuming manual coding, tweaking, and exploration you'd do them on the small instance, which does not cost too much. 3. When you are ready to train a ML model, you can then start the large instance(s), and all the latest code and data will be already there in the bucket for you to start the training. Make sure that the training output is stored in the shared bucket. 4. Create some script on the large instances, so that they will shut down automatically at the end of a long-running training session. This way they won't cost you money while doing nothing for you. 5. Since the result is placed in the bucket, you can then inspect them from any instance at your leisure. This way the large instances are then used in an on-demand fashion, which then should reduce the cost a lot. If you use the 'on-demand or spot instances' as offered by the hosting service, then it should reduce your cost even further. I have verified that the above setup works well on the GCP. One caveat is the warning mentioned in the "Accessing Cloud Storage from Instance" section above regarding concurrent write operations into a shared bucket. ###### Creating a server image On Amazon's EC2 I was able to create an AMI image for one of my server instance, then use it to spawn another server instance. This is very useful since otherwise I will have to go through tedious configuration process for each new copy of the server instance. On GCP this can be achieved from the Compute Engine VM Instances Dashboard, by creating a snapshot out of an existing instance. The snapshot can then be used to launch a new VM instance, with different capacity (say, with 8 virtual CPUs and more disk memory) if needed. The tests I conducted worked well on GCP. In particular I have checked out the following with success on the new VM instance: 1. All installed packages are present and function correctly. 2. The shared bucket are mounted as expected, so it can be accessed immediately. 3. The TensorFlow code that I placed in the shared buckets can be executed with no problem. 4. The SSH key pair is installed on the new VM instance, so I could SSH into the new instance without additional work. ###### Submitting a Training Job (unfinished) The Cloud ML is a managed service on GCP that supports scaleable training and deployment of large Machine Learning jobs. Following is the official description: Google Cloud Machine Learning is a managed service that enables you to easily build machine learning models, that work on any type of data, of any size. Create your model with the powerful TensorFlow framework that powers many Google products, from Google Photos to Google Cloud Speech. Build models of any size with our managed scalable infrastructure. Your trained model is immediately available for use with our global prediction platform that can support thousands of users and TBs of data. The service is integrated with Google Cloud Dataflow for pre-processing, allowing you to access data from Google Cloud Storage, Google BigQuery, and others.  CLoud ML supposedly offers many benefits. My limited goals here are: to check out the basic mechanism of submitting a Cloud ML job, see if am I able to train a TensorFlow model much faster through it (as opposed to training the model on my own VM instance), and getting a sense of the cost associated with it. Note that the Cloud ML API came with the following warning (as of October 2016): *Beta: This is a Beta release of Google Cloud Machine Learning API. This API might be changed in backward-incompatible ways and is not recommended for production use. It is not subject to any SLA or deprecation policy.  I started with following the instruction on gcloud beta ml jobs submit training, which did not go very far. The instruction is terse and without examples, and many things are unclear in the instruction. For example, it is entirely unclear what should be included in the required tar.gz files. I will update this post with more information later when better instruction becomes available. ###### Tensor Processing Unit (TPU) What about the TPU that Google announced in May 2016, which was used in the AlphaGo system and held great promise in speeding up deep learning tasks? Unfortunately there is no sign of it on GCP as of November 2016, and it is entirely unclear when we will see it on GCP and at what cost. ###### Support Forum While looking for solution to my Python problems here, I found a support link that leads to a forum which seems to have fairly low traffic. I did get prompt and informative response which eventually help me resolve my issues. ###### Annoyances There were some annoyances during my tests: 1. The console terminal used for accessing an instance, either through the browser-based terminal or through a native SSH client, is frustratingly slow. Even just pressing the "ENTER" key could take a couple of seconds for it to respond. This is on an instance that has nothing else running on it at the time. This is quite unacceptable. 2. Cloud storage performance: copying files between an instance's disk and a bucket is excruciatingly slow. Just copying a directory of a ghost blog system (~186MB）took hours. In a separate test I tried to un-tar a large ML dataset directly in a bucket without copying between local disk and the bucket. The dataset in question was CSTR VCTK Corpus, which contains many tiny text files (aside from audio samples). Visual inspection showed that such tiny text files were extracted into the bucket at the rate of roughly 40 files per minutes. One training dataset that I needed for a Machine Learning job contains 200000 files, which means that just unzipping it into a bucket will take 5000 minutes, or more than three days! In comparison, unzipping the same dataset on an instance's local disk took only 25 seconds. 3. Auto-mount cloud storage: I had difficulty getting a mounted bucket to survive a reboot. I needed this so that I do not have to remount it manually for every session. Without getting this to work it is then impossible to set up my target environment using multiple preemptible servers with shared buckets among them, so that I can take advantage of the much lower cost of preemptible servers. It took me some effort to get this to work. 4. Recovery: I somehow bricked several VM instances, likely due to trying to add an entry to /etc/fstab in order to get a GCP bucket remounted automatically upon reboot. The VM Instances dashboard shows that the instance is up normally with a green check mark and no error messages, but somehow it was not possible to connect to it using normal means (Winscp, SSH, Cloud Shell, browser-based SSH windows, etc.). I can't find any way to roll back the change, so eventually had to resort to deleting the instance. 5. I wanted to assign a previously reserved static external IP address to a new instance. This way I then don't have to deal with updating many Winscp and SSH scripts that I set up. This was pretty easy to do using the Elastic IP in AWS, but in GCP it turns out to be more confusing. The UI for dealing with static IP kept telling me "Quota 'STATIC_ADDRESSES' exceeded. Limit: 1.0" while I already have two previously allocated static IP addresses. It was entirely unclear from the Dashboard how I could get pass this. It the end I found through experimentation that I am able to reserve only one static IP address per region/zone, and the only way to reuse one is as follows: 1. Create a new instance in the same region/zone as the old instance where the static IP was assigned. The following won't work if the two instances are in different regions/zones. 2. Detach the static IP address from the old instance from the VM Instances dashboard by changing its External IP to Ephemeral. 3. Assign the static IP address for the new instance. This can be achieved only from the Networks dashboard, and not from the VM Instances dashboard. 6. Problems with cloud storage. Took a little while for it to become obvious to me that files on the GCP storage are not normal Linux files, and in many cases require special handling. While the AWS S3 files are also not normal files, it is certainly not as finicky or slow as GCP's. For example: 1. I was unable to change the access permission for a file in the bucket, such as when trying to give a shell script the 'execute permission. Doing chmod on the file simply has no effect and there is no warning message why it has failed. Changing bucket permissions from the Storage dashboard has no effect either. 2. You cannot move or rename files/directories like normal Linux files/directories. You will need to use special commands for this as per instructions here. Such commands can be quite tedious to use if you so happen to have very long bucket names, as is my case when I took the default assigned by GCP (e.g., console-xxxxxx.appspot.com). I found that in many cases (if the directory/file is not too big) it is actually much easier to just use winscp to do the copying. ###### Conclusions The GCP is mostly very well done. The prospect of easily achieving large-scale ML computing through TensorFlow on the Google Cloud Platform is alos quite appealing. From the perspective of simple hosting and cloud storage, price-wise the GCP is roughly competitive with Amazon's AWS. For the sake of running Machine Learning jobs I really wish that GCP has TPU or GPU support. Even running with 8-vCPUs (the largest configuration available under the free trial) it is still not adequate and often for doing Machine Learning and it often takes days for me to complete even modest training tasks. On the down sides, sorting out the setup problems mentioned above took me practically all day, which was much more than I had expected. There are also some nagging issues, such as auto-mounting and access permission for the GCP storage, that took quite some back and forth with the GCP support people (which were very helpful) to figure them out. Overall setting up a VM instance with the ML package was unnecessarily complicated, forcing me to go through dozens of steps with many pitfalls. GCP should have just created a number of pre-configured ML server images for user to choose from, so that setting it up could be just a matter of making simple configuration choices then be done in minutes instead of hours. Perhaps this is what you get with the ML package being in the beta phase, and I trust that it will get better over time. Microsoft Azure also offers many Machine Learning packages, and AWS also has good support for various GPU instances, so what advantages does GCP have over Amazon AWS or Microsoft Azure for Machine Learning? GCP is a natural choice for my ML needs due to the following reasons: 1. My immediate goal is mainly about conducting Machine Learning researches, and not about using existing Machine Learning packages intended mostly for the business community. 2. Lately many of the leading-edge Machine Learning papers have come from Google's DeepMind group, which also kindly releases many of its source codes often implemented in TensorFlow, and GCP has better support for TensorFlow. 3. TensorFlow would appear to be in the leading position regarding supporting large-scale deployment of Machine Learning applications. 4. Google does a very good job about making TensorFlow available to the research community. If you need to do large-scale Machine Learning either for research or business, GCP+TensorFlow holds the promise of being one of the best choices. While it is possible to install TensorFlow on AWS with multiple GPU instances for meeting such needs, at this time it is likely to require more work in scaling and configuring the setup. The caveat here is that the high-end scalability of the GCP+TensorFlow is not yet readily obvious in my limited tests. 1. Machine Learning on Google Cloud and AWS/EC2, Hands-on: practical issues about running computing-intensive jos on GCP or AWS/EC2. This uses DCGAN (Deep Convolutional Generative Adversarial Networks) as a test case. 2. Image interpolation, extrapolation, and generation: looking into the possibility of using the DCGAN for the purpose generating images (and eventually 3D models) from textual commands. This is part of the How to Build a Holodeck series. 3. The terraAI manifesto 4. 1. ]]> <![CDATA[Word2vec and IPA]]> Word2vec is an invaluable tool for finding hidden structure in a text corpus. It is essential for the TAI's IPA project, but we will also need to add some refinements over the standard Word2vec in order to meet our needs. This post is part of the TAI thread, which explores ]]> http://www.k4ai.com/word2vec/915c5005-f820-4b62-ab70-0d4ae445fd92Thu, 18 Aug 2016 01:13:38 GMT Word2vec is an invaluable tool for finding hidden structure in a text corpus. It is essential for the TAI's IPA project, but we will also need to add some refinements over the standard Word2vec in order to meet our needs. This post is part of the TAI thread, which explores how to design and implement the terraAI (a.k.a. TAI) platform. This post is also part of the IPA sub-thread which is focused on issues related to applying the TAI platform to create an Intelligent Personal Assistant. We explore in this post the additional refinements that are needed on top of the standard Word2vec in order to make it usable in the TAI project, which eventually will lead to detailed implementation specifications. This post is also a call to the research community for contribution of insightful comments and development effort. Please read here for the benefits of participating in this project. ###### A quick recap about TAI TAI, abbreviated from the name terraAI, is a knowledge-based crowd-driven platform for acquiring knowledge about our world, as well as for serving certain practical purposes, through interaction with its users. This blog a working document for sorting out the design and implementing issues for the TAI platform. More information about TAI can be found in the TAI Manifesto. TAI's target operating environment is as follows: 1. Online. TAI operates over the Internet, support the socialization and collaboration of its participants, and also has much of its source of learning material acquired over the Internet. 2. Highly-distributed machine learning. We want machine learning to occur in a highly distributed fashion for several reasons: 1. Alleviate the bottleneck on a central server and make the overall system more scalable. 2. Realize TAI's design goal of supporting the crowd-driven knowledge acquisition model. 3. We want to allow TAI to be highly customizable and trainable towards each user's particular needs and habits. 3. Internet-based user interface, which are mainly in the form of web browsers or mobile devices. 4. It is assumed that a default Word2vec skip-gram model is provided by the system. Additional incremental training might be required to satisfy an end-user's needs, which typically will occur on the client side. ###### About the IPA As mentioned earlier, the IPA (i.e., the Intelligent Personal Assistant) project is an application of the TAI platform, where we seek to build an personal assistant that is capable of adapting to user's requirements and idiosyncrasies through supervised and unsupervised learning, and satisfy user's needs of information processing and various tasks over the Internet. The ultimate goal of the IPA project is to use the power of the crowd to create a long-lasting knowledge base about our world, while eliminate concern about privacy-related issues. Another TAI application under consideration can be seen in the Holodeck sub-thread. ###### About Word2vec MORE ###### Using Word2vec in IPA Why do we need Word2vec in the IPA project? For IPA, the target domain of discourse in the Internet, meaning that IPA will need to learn and perform tasks based on material available over the Internet, such as unstructured webpages, documents, user's behavior in a web browser, etc. Word2vec (or its derivatives) is very good at capturing the syntactic and semantic relationship hidden in an unstructured textual document through unsupervised learning, as such it is invaluable as a tool for converting unstructured documents into a more meaningful representation for further processing. ###### Overall requirements as per IPA Following are some refinements needed over the standard Word2vec for the IPA project: 1. Need to support client-side training. This is per TAI's distributed design approach. This means that we need it to run on a typical web browser (i.e., written in JavaScript) or mobile devices (i.e., Java for Android or Objtive-C for iOS devices). For simplicity sake, let's aim for JavaScript as a start. 2. Automatic tokenization. With the standard Word2vec it is assumed that some tokenization will occur prior to the actual model training. For example, during this process the text string 'New York' will get converted to one token and then treated as an atomic element henceforth. In other words, even though the string New York technically are two words in English, we must treat it as one word in the context of Word2vec. This requirement for tokenization prior to model training is in fact a severe impediment when attempting to use Word2vec in the context of certain real-world applications. For example, if the input comes from news feeds, then we are going to find new words all the time, such as the name of someone who just become famous (say, Jeremy Lin), and the system wouldn't know how to deal with them properly until such names are tokenized and the model is retrained, which is not a quick process. For TAI it actually gets worse, since it needs to deal with HTML code. As such, we need such tokenization to occur automatically, and efficiently, which the standard Word2vec does not support. 3. Fragmental model: we need the word vector model to be stored in a form that allows a client to download only what it needs quickly. This is needed considering that a model trained on GoogleNews has a size of 1.5 GB compressed, which makes it entirely unusable on an internet-based client device. If this seems to be an unusual requirement, there are actually precedents for this in the video space. For example, in the early days a video is encoded as a single large file using a certain video codec. But obviously this does not work too well for live video streaming, especially if we want to allow a user to skip around in the video, or replay part of the video, or play in fast forward or slow motion. As a result something called the Dynamic Adaptive Streaming over HTTP was invented, which essentially breaks a large video into small HTTP-based file fragments, and a video client (that is, a video player) gets only what it needs at the moment. Conceptually speaking, the requirements for the fragmental Word2vec model is entirely similar to this, driven by the same desire to find a solution to reach more light-weight clients and also to become more responsive to client requests. 4. Support token layers. The tokenization of multi-word phrases sometimes also results in the lost of information that is important for performing semantic analysis (which we needed in the TAI project). For example, if the name Jeremy Lin is tokenized then we lose the fact that this person's last name is Lin, which might be important in a certain context. As such, it is desirable to have a phrase tokenized in multiple ways. 5. Incremental model training. The standard Word2vec more or less assumes a batch mode of operation. I.e., the system performs tokenization, then performs learning over the input text corpus to produce a word vector model, then use the model for a certain task. If there is a new batch of text material, then the process is essentially repeated which will take quite some time to perform. As such it is not suitable for a more dynamic type of environment like what we need to deal with in the TAI system. What we needed here is a way to allow Word2vec to accept new training material and perform learning over it dynamically and efficiently. ###### How to meet the requirements So how do we propose to meet the requirements listed above? Following are some ideas. If you think you have a better idea, by all means please voice your input in the comments sections below. 1. MORE ###### Going beyond Word2vec Word2vec has demonstrated that the word embedding approach is able to capture multiple different degrees of similarity between words. Mikolov et al has also found that semantic and syntactic patterns can be reproduced using vector arithmetic. So how can we build on top of it to achieve what we are aiming for the TAI/IPA project? As a reference, Word2vec's skip-gram model aims to maximize the corpus probability as follows:  $\arg\max \limits_\theta \prod_{w\in Text}{\left[ \prod_{c\in C(w)} p(c|w;\theta) \right]}$  given a corpus $Text$, a corpus of words w and their contexts $c$, $C(w)$ as the set of contexts of word $w$, and $θ$ as the system's parameters. For TAP/IPA we assume that there is a knowledge base $K$ which is built up using various means, such as unsupervised learning, supervised learning, or manual entry. This knowledge base K is to be used for assisting in semantic analysis, for carrying out the requested tasks, and for allowing K to be accumulated and updated through machine learning methods. Mathematically this can be described as follows:  $\arg\max \limits_\theta \prod_{w\in Text}{ \left[ \prod_{c\in C(w)} \left[ \prod_{k\in K(c,w)} p(k|cw;\theta) \right] \right] }$  where given $c$ and $w$ we wish to find the $\theta$ for deriving the optimal explanation (i.e., the $k$) of the given words. Further investigations on this topic are discussed in a separate post (upcoming). ###### Looking ahead It would be fascinating to extend this approach into multi-modal space, so that it not just about text, but also bringing images, videos, goals, and agent intentions into the picture. This particular aspect is of particular interest to another sub-thread of this blog How to build a Holodeck, which we will explore separately. ###### References 1. The TAI discussion thread in this blog. 2. The TAI Manifesto in this blog. 3. Wikipedia: Wrod2vec 4. Efficient Estimation of Word Representations in Vector Space, the original 2013 Word2vec paper by Mikilov et al 5. fastText is an open-source library for efficient learning of word representations and sentence classification. It was created by Mikolov et al at Facebook. 6. Bag of Tricks for Efficient Text Classification. This paper proposes a simple and efficient approach for text classification and representation learning. 7. Online Word2vec playground: Word2vec Word Vectors in JavaScript 8. MORE ]]> <![CDATA[Ghost blog enhancements]]> This post is about the various enhancements that I have added to the Ghost system for this blog. This is a follow-up on the other post regarding how to convert a Ghost blog into a static website. This blog in itself is mainly for the purpose a certain AI project ]]> http://www.k4ai.com/ghost2/b6d06618-803a-4c7e-8fb5-3568c4c11e5eTue, 09 Aug 2016 15:18:00 GMT This post is about the various enhancements that I have added to the Ghost system for this blog. This is a follow-up on the other post regarding how to convert a Ghost blog into a static website. This blog in itself is mainly for the purpose a certain AI project that I am working on. All of the enhancements discussed here are visible live in this very blog. ###### Customizing the tag header image By default Ghost picks up the header image defined for the home page and use the same image for the header of all tag pages. Here a tag page is the page that you landed on after clicking on a tag in the blog's tag cloud, which shows the list of all posts that has the tag. My problem with this is that visually a tag page (i.e., the blog tag page) now looks almost identical to the home page, so it takes some effort to tell them apart. Fortunately the tag header image can be easily restyled from CSS. Following is an example: .main-header.tag-head { background-image: url("newtagimg.jpg") !important; }  ###### Display Post Images on Home Page The Ghost home page displays a list of recent posts in the all-text format. It is much preferable that we are able to display an image thumbnail for each post, since it is visually much more pleasing. This article shows a way on how to do this, using the header image defined for each post as the thumbnail. However, if you do not have the header image defined for every post, then you will be seeing a broken-image icon for those posts. To handle this, for the image element inserted you should add an 'onerror' attribute as follows: <img src="{{image}}" onerror="this.style.display='none'" /> ###### Adding universal translator This is fairly straightforward, just go to Google website translator and get the embed code for inserting into your blog. The visitors to your blog will then be able to view your posts in one of the more than 100 languages. For my posts, the quality of translation by this Google service is actually pretty poor in many cases. I kept this feature only because the TAI open-source project described in this blog is meant to serve the entire humanity, as such I wish to broaden its reach as much as possible, even if the translation is quirky. ###### Adding a tag cloud A tag cloud is very useful for providing quick viewing and navigation to different types of posts in a blog. For some reasons this feature is not built into Ghost's template engine (as of version 0.8.0), so I have to find outside hacks to make this happen. I found an article that talks about how to add such a tag cloud. Unfortunately it is in Chinese, but it does work. The only problems that I found are: 1. The post count is not showing, and as such some tags appear to be clipped on the right side. This is remedied by using CSS to hide the elements that contain a missing post count. I traced the problem on the server side to the file tag_cloud.js: tagCloudOptions = { limit: 'all', include: ['post_count'].join(','), context: 'internal' }; return api.tags.browse(tagCloudOptions).then(function(tags){ ....  where the result from the call to api.tags.browse does not return post_count in it. Turns out that the post count is not in tags.tags.post_count as the original code expected, but rather in tags.tags.count.posts. The query also must be changed from: tagCloudOptions = { limit: 'all', include: ['post_count'].join(','), context: 'internal' };  to: tagCloudOptions = { limit: 'all', include: 'count.posts', context: 'internal' };  Once adjusted for these difference then it works fine. ###### Adding QR code QR code is very handy for opening a webpage on a mobile device quickly and easily through the mobile's camera. Here I have added a QR code on every page, so anyone can open the page by simply using a smartphone's camera to scan it. My solution is to add the QR code from my own Javascript code as follows:  var url = location.href; var qrcode = '<img src="https://chart.googleapis.com/chart?chs=120x120&cht=qr&choe=UTF-8&chld=L|0&chl=' + url + '" target="_blank" title="Show QR Code"></img>'; var sidebar =$('.sidebar');
sidebar.append(qrcode);


When you are in the middle of reading a long post, Ghost does not make it easy to navigate around. It takes some effort scroll back to the top of the page, to find the place for making comment, or to go to the blog home page, etc. I solved this problem by adding a navigation bar at the top of all pages in the blog.

The behavior of this bar is as follows:

1. The bar offers buttons for navigating to the home page, the top of the page, the comment section, and the help page.
2. When the page is scrolled down the bar will disappear
3. When the page is scrolled up the bar will appear, which makes the bar easily available with just a simple scroll or swipe from anywhere.
4. The bar can be collapsed and reopened as needed.
5. The state of the bar (i.e., whether collapsed or not) is remembered using a site-wide non-expiring cookie, so it will keep it state when the blog is revisited later.
6. If user is a Ghost admin (found by using JavaScript code to inspect browser's localStorage as follows),

var session = JSON.parse(localStorage.getItem('ghost:session'));
isAdmin = ! $.isEmptyObject(session&&session['authenticated']);  then also add another button for quick navigation to the Ghost admin for editing the page. You can test out these features (except for the admin one) right on this page. ###### Customizing the error page The default 404 error page for Ghost is rather simplistic, not vertically centered, also looks a little off to the side when viewed on small screen mobile devices. As such I have customized it further by modifying the file core/server/views/user-error.hbs Unfortunately I also found out that once my Ghost blog is converted to a static website using Buster, then somehow it will auto-navigate to the home page on 404 error. Perhaps I still need to do something on the the AWS S3 site that I use to host the blog, but I will leave it at that since this feature is not that important to me. ###### Display math formulas A key piece of my TAI project is its machine learning capability, and as such there are many discussions around machine learning algorithms in this blog, which involves a lot of math. It order to display math formulas in a nice way I choose to use MathJax for the purpose. More about this can be found in this post. As an example, here is a random math formula rendered using MathJax: $f(a) = \frac{1}{2\pi i} \oint_\gamma \frac{f(z)}{z-a}, dz$ I have tried the following approaches: 1. Insert the following into Ghost's code injection footer area. This renders math formulas on the run-time pages. <script type="text/javascript" async src="//cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>  Down side: the math formulas on the admin pages still rendered in its raw format. This makes it more tedious while composing a post. 2. Also insert the same script above into the file core/server/views/default.hbs. Result: the math formulas now gets rendered in the Ghost admin area, which is nice. However, newly inserted MathJax statements do not get rendered dynamically, so I still have to refresh the entire page in order to see it rendered, which is not good. 3. Next step: find a way to have MathJax statements rendered dynamically in the admin. Here is a jsfiddle example here which might be helpful. More on this later. ###### Custom search I tried to use the Google Custom Search for supporting searching within this blog (see the How to add Google Custom Search to your Ghost blog link below), but had to give up on it. Problems found including double search fields, and malformed search result. ###### History I added a feature that allows a user to see the list of posts visited, displayed in the latest-first order. This is so that a user is able to find the recently visited post, resume reading of a long post, etc. This was done entirely using client-side JavaScript code, including storing and retrieving information in a cookie, render the list of posts, rendering time stamps in the timeago format, etc. You can find how it works here. It is also accessible from the nav bar at the top of every post. Update: alright, so the above solution is only sufficient for storing the history of a handful of posts, due to the limited capacity of a cookie (~4KB). As such I have updated my code to use local storage instead. ###### Display wrap-around images Long post with nothing but text is hard to read, and having some images sprinkled through the post helps somewhat to improve its readability. However, an in-text image in Ghost may span almost the entire width of the browser and occupy too much screen real estate. A better way is to display smaller images with text wrap around them. I settled on the following design (mostly achieved through CSS): 1. Add a CSS style definition such as the following: .post-content img[src$='#postHeaderImg'] {
float: left;
left: 0;
transform: initial;
-webkit-transform: initial;
margin-right: 30px;
margin-bottom: 0px;
max-width: 50%;
}

2. Whereever an image is needed in the post, enter it using the following Markdown syntax:

![](/content/images/2016/07/ghost_logo_big-100529614-primary-idge.png#postHeaderImg)


where in the above you are supposed to replace the image path to what you wanted. This will cause the image to stick to the left side, with the text flowing around it.

You should make it responsive by refining the CSS definition so that this behaves well on browser of all sizes. To test it out, just resize the width of your browser now to see the effect.

###### Image caption

Having a caption under an image is handy sometimes, especially when the image is not just decorative but also requires some additional explanation.

This article Adding image captions to Ghost shows a way to do it. You can use CSS to style it further to get what you wanted.

Note that the method mentioned above uses the alt attribute of the image element as the source of the caption, and additional DOM elements are created using JavaScript code to achieve this effect. You can also add HTML syntax inside the caption (see the heavier font weight used to display the names of the artwork and the artist) and it will get rendered properly.

When viewing a list of posts, such as when viewing from the home page or a tag page, it is useful to be able to see the number of comments on each of those posts. This way a user is then able to follow the discussion by checking if the count has changed from his/her last visit.

Since we are using the Disqus comment service here, here are some simple instructions about how to display comment counts on the home page.

One way to view additional material (say, a video) is inserting it inline. But sometimes this is too disruptive to the flow of the text, so a typical solution is to navigate to another page where the material is, when user demands it. This is also somewhat disruptive. Yet another solution is to use tooltips, so the target material can be displayed in a bubble when needed, and dismissed when done, without changing losing the position on the page.

Here are two live examples showing YouTube#1, and another video YouTube#2. Just click using a mouse (or touch it if on touch device) the underlined text, and the video will then appear floating over the text.

I implemented this on top of the Ghost system using the wonderful jQuery library qTip2 from Craig Thompson. It is not hard if you know how to program in JavaScript, but it is a little too involved to be described right here. If you are interested I'd recommend that you just go look at the demo examples offered on the qTip2 website.

###### To-do list

Following are some features that I'd like to add, but have not gotten around to do them just yet. By all means please let me know if you have answers for the issues below

1. Adding the post counts in the tag cloud.
2. Sort the items in the tag cloud by post cloud, or perhaps even style them based on counts.
3. Add Google Custom Search, so that it is possible to find posts based on text search. This was tried earlier, but somehow the resulting layout (as done by Google's code) was quirky and unusable.
4. Need a more efficient file method to deploy to my static website hosted on Amazon AWS S3. As I write new posts and added code libraries to support the new features that I wanted, it takes longer and longer to crawl the site using Buster, and then upload them to S3. This is because currently Buster creates files with fresh timestamps so there is no easy way to locate newly changed file, so I ended up having to upload everything to S3 everytime, even if only one file has been changed. Need a way to upload only recently updated files.
Update: the combination of HTTrack and Winscp worked out reasonably well.
1. The Ghost publishing platform.
2. Buster, brute force static site generator for Ghost.
4. MathJax: a library for displaying math formulae, useful when discussing machine learning algorigthms.
]]>
<![CDATA[Holodeck - Knowledge Representation - part 2]]> This is part 2 of the Holodeck series, focusing on issues related to knowledge representation (KR). This is a followup on the first post Crowd-driven Holodeck, where a skeletal design was presented for a HAI Holodeck.

###### A quick recap

To put this post in context (more can be found in

]]>
http://www.k4ai.com/holodeck-kr/466c7bf9-7724-435b-821c-772692d4247dSun, 07 Aug 2016 13:58:51 GMT

This is part 2 of the Holodeck series, focusing on issues related to knowledge representation (KR). This is a followup on the first post Crowd-driven Holodeck, where a skeletal design was presented for a HAI Holodeck.

###### A quick recap

To put this post in context (more can be found in the first post:

This series discusses a HAI Holodeck, which is a highly watered-down version of the Holodeck found in the TV series Star Trek - The Next Generation. Here we ignore issues related to realizing goggle-less VR (virtual reality) and any issues on the hardware side.

We settled on a knowledge-driven approach, using machine learning (ML) for knowledge acquisition, and an user interface suitable for drawing assistance from the crowd.

And just in case you haven't figured it out yet, this blog is not a rigorous research paper. It is more like my personal musing on a very serious and difficult topic, which the intention of coming out with some workable directions in the future.

And yes, you do need a solid background in Artificial Intelligence in order to understand what's written in this post.

###### KR issues

Here we will tackle the following questions:

1. The kind of knowledge needed for supporting the HAI Holodeck
2. The mechanism with which such knowledge are represented
3. The mechanism for populating such a knowledge base, either unsupervised or supervised.
###### The knowledge needed

First we look at the various types of knowledge that are needed in order to support the HAI Holodeck.

1. Structural and attribute information about a given 3D object, including its sub-components. For example, we would want to know that apples have skin (endocarp), flesh (receptacle tissues), core (pericarp), and seed, etc. in an onion-like structure.
2. Probability distribution for object attributes, structural relationships, label-object relationships (i.e., categorizations). For example, we would want to know that most apples are have red skin and white flesh.
3. User intention. For example, we need to know that if a user requests a blue apple, then the blue attribute most likely is referring to the skin of an apple.
4. Spatial reasoning. HAI should be aware that normally only the skin of an apple is visible, capable of computing dimension and spacing needed for building a staircase, etc. MORE
5. **
###### Knowledge Representation

So how do we represent an apple? What does it take so that when a user requests a blue apple in the size of a watermelon that HAI knows what to do?

It is often an overlooked issue, but it should be made abundantly clear here
KR and ML should be tightly coupled. That is, whatever representation scheme (KR) that we select here, it must be conducive for conducting machine learning (ML).

In recent years the ANN (artificial neural networks) has made great advances, and currently is the indisputable leader among ML technologies.
As of this writing ANN is clearly our best choice for doing knowledge acquisition.

The choice of ANN immediately tilts the balance against many other KR schemes 1, such as First Order Logic (FOL) or Semantic Networks (SN), which cannot be easily coupled with ANN. My personal belief is that we need a KR scheme that is based on ANN, and hopefully this new ANN-KR scheme can be validated to have equal or higher representation power than FOL or SN.

And furthermore, since we are dealing with 3D objects here, this ANN-KR scheme must also account for the representation and recognition of 3D objects, and not just for abstract knowledge. It is my belief that with his joining of abstract and physical knowledge, plus the integration of ML capability, then we will have something very powerful.

The question of course is how to achieve this grand unification of the following goals:

• Goal#1: Representation of physical knowledge, i.e., 3D objects
• Goal#2: Representation of abstract knowledge
• Goal#3: Machine learning for the acquisition of the above knowledge
• Goal#4: (Later) Inference
• Goal#5: Achieve all of the above using ANN

And on a side note, I also want this scheme to work fully well for the TAI project, but that's a separate topic.

TO BE FILLED

Since we are dealing with visible objects in this HAI Holodeck project, we will start with the standard convolutional neural networks (CNN). The diagram shown above is such an example.

Let's assume that we have done our share of pre-training, so that the lower levels of the CNN contain features that are useful for describing the objects in our target domain (say, fruits, chairs, etc.).

The paper by Dosovitskiy et al Learning to Generate Chairs with Convolutional Neural Networks2 gives us a excellent starting point, where a generative CNN can learn from examples, find a meaningful representation of a 3D chair model, and then generate new style chairs given type, viewpoint, and color.

So let's look at it further from the perspective of KR.

Quoted from the paper:

The last two rows show results of activating neurons of FC-3 and FC-4 feature maps. These feature maps contain joint class-viewpoint-transformation representations, hence the viewpoint is not fixed anymore.


Here the interesting part is that the FC-3 and FC-4 feature maps contain joint class-viewpoint-transformation representations, so this is a good hint that if we train a CNN on joint information from multiple sources, then we might be able to treat some layers of the CNN as a meaningful representation.

It is worth noting that we listed five separate goals earlier, but the Dosovitskiy paper points to a situation where views and class labels have become co-mingled.

###### Representing object composition

Given the ANN approach that we are pursuing, how should we represent the composition of an object, such as the fact that chairs typically (but not always) have four legs, a back, a seat, etc.

Imagine that a user requests that a chair $C_1$ as proposed by HAI be modified, by referring to its parts. Such as:

User: I'd like the chair legs to be in the French Baroque style, but keep the current color and texture.


In this case HAI must perform the following:

1. Understand which parts of the chair $C_1$ are legs
2. Understand what a French Baroque style chair $C_f$ typically looks like.
3. Extract the legs of the chair $C_f$, and transfer their shapes to the legs of $C_1$
4. Ensure that the change is merged with the body of $C_1$ correctly.

So how are such knowledge represented?
TO BE FILLED

###### Spatial reasoning

How do we represent spatial knowledge, for example:

1. A chair will not fit inside an egg.
2. A wooden cubical box where each side is one meter long, if the plank for making the box is 0.2 meter think, then the inside of the box is a cube with each side 0.8 meter long.
3. ...

Having the capability to reason about spatial constraints is of utmost importance to HAI, since it is only with this what it is able to understand what's likely, what's difficult, what's impossible, and making reasonable estimates.

It would appear that so far there are little research done regarding using ANN for performing spatial reasoning, as such we know little about how to represent it in some form in ANN.

TO BE FILLED

###### Abstract Knowledge

So what about the representation of abstract knowledge, such as chairs are furnitures, or furnitures are usually found indoors, etc?

We take the position that such abstract knowledge must be grounded on top of physical knowledge. Here we are not trying to start a philosophical debate, but rather this is more of a simplifying engineering decision, just so that this project becomes more feasible. Another way to look at it is that we are limiting the type of abstract knowledge that we deal with to only those that are directly or indirectly related to physical knowledge.

More specifically, we deal with the following types of abstract knowledge (for now):

1. Labels attached by a trainer to an object, or part of an object. These could be categorization labels, or ....
2. Anonymous labels attached by HAI to an object, or part of an object, during the process of unsupervised learning.
3. Bayesian rules acquired in the form of posterior probability between labels and observed physical attributes.
4. MORE?

TO BE FILLED

###### Representing user intention

How do we represent a user's intention (while ignoring the NLP aspect for now)? It is helpful to see the problem of fulfilling a user's request as a goal-oriented task, where HAI must find ways to achieve the goal, possibly including breaking down a goal into multiple sub-goals. In light of this, a user's intention is then a goal state to be fulfilled by HAI.

Solving a goal-oriented task requires the presence of a set of rules, as well as a mechanism to back-chain over those rules. Here we do not mean that we want to bring in goal-oriented system such as Prolog, but rather we want to find a way to achieve similar goal-oriented behavior using ANN.

Following are some thoughts regarding how to achieve goal-oriented behavior in the context of ANN:

1. Use some form of ANN to acquire Bayesian distributions among many abstract labels and observed visual facts. These are essentially our rules. For example, we may have a learned rule that indicates Quaker-style chairs have a 90% chance of being brown, even if browness is not useful for any categorization tasks.
2. We define a new type of top-down mechanism in a trained ANN. Note that this is unrelated to the back-propagation mechanism used during training. MORE COMING

How do we achieve top-down behavior in ANN, and what does it mean? Cao et al uses a top-down mechanism3 to infer the status of hidden neuron activations as a way to control attention. This is in effect a kind of goal-oriented behavior.

TO BE FILLED

###### Representing 3D objects

What kind of "3D model" are we talking about here? Here it is helpful to distinguish two different ways to represent a 3D object.

1. Working 3D model: this is what's being used when the system is still interacting with the user and trying to get something built. Here we need a more abstract 3D object model that is suitable for learning, composition, decomposition, piecemeal transformation, showing relationships, etc. I'd argue that for this purpose it is advantageous if we emulate human brain to some extent, and representing a 3D object as a series of salient images in some way.
Reference: How objects are represented in human brain? Structural description models versus Image-based models
What are the benefits of the **image-based models? Why not just deal with traditional 3D models throughout? I'd argue that:

1. The repertoire of useful 3D models is poor, not well indexed, lacks visual details, and lacks contextual information.
2. Today's image search engines are getting smarter everyday. By relying on images more we then get to piggyback on top of such improvements.
3. Image search engines give us important clues about the relationship between query text and the resulting images.
4. We might gain some advantage by emulating how human brain remembers 3D objects (see below).
NEED MORE DETAIL HERE
2. Run-time 3D model: this is the 3D object representation when we are trying the render into something that a user can see. This could be in the form of one of the popular 3D formats, such as
3ds, U3D, etc.

Question: which run-time 3D model best suits our purpose? Question: any good argument FOR using the Run-time Format throughout, without using a separate Working Format?

Unless mentioned otherwise, it is assumed that we are always referring to the Working 3D model in this discussion.

###### Unsupervised learning of 3D models

How do we achieve unsupervised learning of 3D models? This is not so much for learning categorization (which are likely to be supervised), but for the capability of representing 3D objects in HAI's memory, and be able to track and recognize objects even without categorization. Think of it as a kind of pre-training for object recognition and memorization.

Given that we want to go with the image-based for 3D objects, I would argue that during the initial training phase for acquiring background knowledge, it is beneficial to use videos, and not a set of discrete images, as the training samples. This is because:

1. The time indices in a video training sample contains explicit information about object persistence. For example, if a group of visual features $\{F' _i\}$ observed at time t1 are sufficiently similar to what's observed $\{F' _i\}$ in the next video frame, then HAI can safely assume that such features represent the different views of the same object.
2. The object persistency assumption above also gives us a way to correlate different views of the same object, and register it as a representation for the object.
3. ...

References

1. (2016) Unsupervised Learning of Video Representations using LSTMs, Srivastava et al, University of Toronto
2. Source code for the paper above
3. 100 other papers that cite the above paper

TO BE FILLED

###### Summary

In this post we have sketched out a rough skeleton for the Knowledge Representation scheme necessary for supporting the HAI Holodeck project.

We have worked out the following:

1. Using ANN as basis for representation
2. How to use video training samples to conduct pre-training, so that the system has the basis for performing object reprosentation and recognition.
3. How to map incremental textual requirements to the target 3D models, with human guidance and using some form of neural networks. We call such acquired and validated mapping knowledge.
4. Unsupervised learning for learning probability distribution.

Remaining work:

1. ...

TO BE FILLED

###### References
1. Wikipedia: Knowledge representation and reasoning

2. (2015) Dosovitskiy, et al, [Learning to Generate Chairs with Convolutional Neural Networks]4: (2011) Yoshua Bengio, Deep Learning of Representations for Unsupervised and Transfer Learning

3. (201) ,

4. (2015) Ruslan Salakhutdinov, [Learning Deep Generative

5. Models](http://www.cs.toronto.edu/~rsalakhu/papers/annrev.pdf) (https://www.robots.ox.ac.uk/~vgg/rg/papers/Dosovitskiy_Learning_to_Generate_2015_CVPR_paper.pdf)
6. (201) ,

]]>
<![CDATA[Blogging with Ghost]]> I share with you below my experience with the blogging system Ghost (version 0.8.0 as of this writing) used for this blog, and the problems that I ran into. This is not an exhaustive survey, but hopefully my experience will be useful to someone out there.

This blog

]]>
http://www.k4ai.com/ghost-post/919770ff-c888-449c-a38f-455ebffef4f1Fri, 05 Aug 2016 14:30:00 GMT

I share with you below my experience with the blogging system Ghost (version 0.8.0 as of this writing) used for this blog, and the problems that I ran into. This is not an exhaustive survey, but hopefully my experience will be useful to someone out there.

This blog in itself is a working document for the terraAI project regarding open-source crowd-driven AI platform that I am trying to build.

##### My requirements

Here my are blogging requirements:

1. Must be open source, so that I am able to modify or host it myself if I want to.
2. Must be portable, in the sense that I should be able to move the entire thing to another computer (of the same OS of course) by simply copying a directory, with no need to install many components. This is important to me because I find portable software takes a lot less effort to manage in the long run.
3. Must be relatively easy to turn it into a static website, even if it does not support such a feature out of the box. A static website is treated simply as a collection of static files and can be served from a CDN such as the Amazon S3, and is infinitely more scalable, more stable, more responsive, safer, and much cheaper to host than traditional methods. This is explained further below.
4. Relatively simple to use and manage. I prefer simplicity over richness, so long as the basic blogging functions are acceptable.
5. Easy to hack. Having the original source code doesn't quite cut it. I prefer NOT having to hack into the source code to achieve what i want, since it will cause headache later when there is a need to upgrade to a newer version. I mainly want the ability to add my own code on top of it to override the default behavior with ease.

In the end I opted to go with Ghost. Since Ghost more or less meets my requirements above, I did not bother to look further into WordPress. If anyone has insight into using WordPress in the context of requirements above, by all means please enter your comments at the bottom of this post.

###### A blog as a static website

There are many benefits to having a blog turned into a static website:

1. More scalable: you no longer have the database or the web server or the virtual hosting machine as the performance bottleneck, so your blog will always load fast even if there are huge number of people accessing it.
2. More stable: there are way less things to break, since your blog is just a collection of files.
4. Safer. There is not much for a hacker to get into. There is no special admin port for databases, or weakness in the web server, etc., for a hacker to exploit. So long as your file-serving host (e.g., Amazon AWS S3, GitHub, etc.) are solid then you will be fine.
5. Cheaper. It should take only pennies per month to host such a blog on platforms such as S3 (or for free on GitHub). This compares to somewhere close to USD\$10 per month when hosted on the smallest virtual instance on Amazon EC2, or Rackspace, etc.

There are, however, some downsides:

1. More upfront work. I ran into some problem where the tool for conversion to static website does not convert hyperlinks correctly, so I had to write some code to correct that. I also had to create some simple scripts in order to streamline the blog management tasks (e.g., converting to static website, uploading to hosting site, etc.).
2. More management work. In a normal Ghost setup it would take just a few clicks to publish a post. Here you will have to use separate tools to convert and then upload, which take somewhat more work.

Converting to a static website

Normally a blog cannot be deployed in the form of a static website, because there needs to server-side logic to do work such as storing or finding information in a data base, enforce security control, etc.

However, there are ways to make this happen:

1. Separate the environments for production where visitor get to read the blog, and the development environment where the blogger writes a blog, change configuration, etc.
2. The development environment can be either installed on a local computer, or somewhere on the cloud, and accessible only to authorized people. In other words, this environment operates fully and normally as it should, but it is not accessible to others.
3. The production environment is a static snapshot of the development environment, taken using some tool (I have tried HTTrack and Buster for this).

But what to do with the server-side logic needed for its operations on the production environment? This is handled in two parts:

1. First you don't do blog management directly on the production environment, so most of the complex server-side logic are not needed there. Instead you'd manage your blog in the development environment, then copy the changes to the production later.
2. For visitor activities such as commenting or polling, etc., use an outside service for it. For example, you can use DISQUS for supporting comments, and it is pretty easy to do so. Here is a good article How to integrate Disqus Comments with Ghost that shows you how to do this.
###### Components vs Services

I have explained above how to use a service, such as DISQUS, for supporting commenting by visitors. When compared to, say, using a Comments plugin in WordPress, what are the pros and cons?

Personally I prefer using a service, since it is easier to manage, and it allows me to deploy my blog as a static website (with all the benefits mentioned earlier). Such services also tend to be of higher quality, although you may need to pay for the service if your traffic exceeds certain limits. You can of course create your own services for these, if you know how to code.

At first glance it may seem like we are losing a lot by going with the dumb and thin static website approach, since it means we will lose server-side logic and storage completely. But in fact for the purpose of blogging this is not at all a disadvantage. This is because there are already many high-quality and free services for media display and socialization out there. I would even argue that when considering quality and the need for high-level social interaction, in many cases it is preferable to use external services.

Following are some free services:

1. Video: YouTube (on-demand, playlist, live, integrated chat), Twitch (live, integrated chat)
2. Audio: SoundCloud (on-demand, playlist)
4. Crowd ranking: sMesh (real-time suggest/vote/rank)
6. Crowd emotes: sMesh (real-time emoting in icon or sound with 'crowd chant' effect)
7. Crowd debate: sMesh (real-time debate in two sides with ranking of arguments)
###### Deployment

Deploying ghost from development to production can be achieved as follows:

1. First use a tool such as Buster or HTTrack to dump the contents of your development server into a local directory.
2. Then use another tool (such as Winscp) to copy the files to a CDN server, such as an Amazon S3 bucket. Here it is assumed that you have configured your S3 bucket to behave like a static website.

Here are the problems that I ran into with Ghost:

1. In Ghost's config.js (as of version 0.8) if you set url to point to your development server, then the links in the Share this post area, as well as the links inside the RSS generated, will be wrong when deployed. But if you set it to point to your production server, then the other links in the development environment will be wrong and thus makes it harder to work on.

Aside from the problems mentioned above, following are my experience with Buster and HTTrack for dumping the contents of your development server into a local directory. Both have their own share of problems.

Buster

Here are the problems that I ran into with Buster:

1. As of this writing Buster no longer generate the RSS feed. There is a workaround here for dealing with it using wget
2. The dev/production link problems mentioned above are tackled using my own code to make corrections.
3. Hard to get it installed on Windows or earlier version of Linux. Recommend using Ubuntu 14 or later, or equivalent. I ended up using a guest Ubuntu running under Windows 8 through VirtualBox. The guest Ubuntu accesses the Ghost server on the host (instructions here), and pushes the resulting files to the host through a shared folder (instructions here).
4. I want to use the Casper with sidebar theme for Ghost here, mainly for the purpose of making it easier for visitors to navigate around using the sidebar. This however creates a problem where if the pages put into the sidebar are of the 'Ghost static page' type, then somehow Buster is unable to generate proper static webpages for them, so those become dead links on the resulting static website. One workaround is to publish those NOT as static pages, although that clusters up the list of posts displayed at Ghost home page.

*I ended up using Buster with some amount of custom code to deal with the problem with links, so far it has worked out well. Note that Buster has built-in support for publishing your static website to GitHub, although I have opted to deploy to Amazon S3 instead.
Update: too many links were incorrectly generated by Buster, especially when it involves Ghost static files. Ended up switching back to use HTTrack instead.

HTTrack

Here are what I found about HTTrack:

1. This tool is available on many platforms, which is a big plus.
2. It worked well initially, but eventually I got an MIRROR ERROR. Eventually I found that somehow HTTrack does not like my website URL as localhost:2368 (even though this url works fine in a browser), but 127.0.0.1:2368 does work for HTTrack.
3. HTTrack handles the Ghost static files well (unlike Buster).
4. The links in the meta tags (e.g., og:image, twitter:image, etc., in the static webpages generated by HTTrack) are getting mangled by HTTrack, which will cause problem when a visitor shares a post to a social media site. Same thing with many external links. The is resolved by changing HTTrack's Scan Rules option to exclude external links that cause problem. Following are some examples of the scan rules used:

-*/*www.terraai.org*
-*/*cdn.mathjax.org*/*
-assets/fonts


After some back-and-forth, I ended up choosing HTTrack over Buster.

For incremental deployment I do it as follows:

1. I use the command-line httrack with the --update option as follows

\bin\httrack\httrack --update http://127.0.0.1:2368 -www.terraai.org -ssl.google-analytics.com -cdn.mathjax.org -assets/fonts


which does not update file timestamps unless necessary.

2. I use the Winscp Synchronize command for uploading files through a S3 proxy, which uploads only updated files.
###### Workflow

Given this setup, the general workflow goes as follows:

1. Edit my blog on my locally-installed Ghost system (under Windows 8.1).
2. When ready to publish, run a script that executes httrack to dump the website content into a collection of files.
3. Use winscp to deploy the files to Amazon S3.

Some custom JavaScript code that I wrote took care of problem with incorrect links generated by Buster.

###### Summary of problems

To summarize, following are some of the problems that I ran into:

1. Some links were incorrect and still pointing to the development server when deployed. This involves for example the links in the slide-out menu, the links in the Share this post area, the links inside the RSS text when clicking on Subscribe, etc. This requires additional hacking to resolve, for otherwise the result is not usable.
Solution:
1. In Ghost's config.js file, set the parameter development-url to point to the production server. This will make the URL in RSS, Share this post, social media links, etc., correct.
2. Inject custom code into the admin UI's Code Injection area, for the purpose of scanning the webpage for all links and modify to make them work on the development server, the production server, and as static files.
2. Ghost defaults its link navigation behavior to replace the current page, which I did not like under most circumstances. There is a longish Markdown syntax that supports this, but I am too lazy to use it all over the place.
Solution: I added a little JavaScript code to convert all applicable links to have the attribute target=_blank.
3. Since we no longer have server-side logic, we also lose the ability to do things such as sending emails from the server for things such admin password recovery by email, user subscription by email, etc.
Solutions:
1. There is no solution for admin password recovery by email, so you just need to be careful not to lose it. Note that this has no bearing on the production environment.
2. Subscription for new posts by email can be replaced by the RSS feature, which user can subscribe to and read new posts in his/her preferred RSS reader.
3. User can send email to the blogger owner by client-side email (as opposed to server-side), either using the HTML mail-to feature, or using a contact form.
4. When trying to share a Ghost post on Twitter, the resulting tweet is not showing the correct description or image. By inspecting the webpage I can see that the meta tags twitter:description and twitter:image:src as created by Ghost contain the correct info. Still not sure what went wrong nor how to fix it.
Update 1: turns out that Twitter does not like the way my code was changing the meta elements in the page head section (twitter:image:src, twitter:card, and og:image) dynamically. So far the only way to get the image to show in a tweet is to set the post-level header image from the Ghost admin. I would have preferred not using the post-level image (which is too big), and being able to specify any image of my choice embedded in the post.
Update 2: I ended up setting the post image with the Ghost admin to get the image to show up when sharing on Twitter, and also to show the image on the home page as thumbnail for each post (see the Display Post Images on Home Page link below). The post-level header image is then styled to hidden as per my own taste.
###### Conclusion

Overall using Ghost for this blog has worked out relatively well. I do have to hack a bit to deal with "incorrect" links arise from converting it to a static website, but it wasn't too bad.

1. Easy installation: just unzip the Ghost kit from github and run (assuming that you already have Node.JS). I put it on my Windows laptop and I am able to work on my blog from anywhere, even if I am without an Internet connection.
2. Simplicity is the strength of Ghost, but it can also be liability if you prefer having many ready-made features.
3. Markdown is a minimal syntax for marking up your documents with formatting using punctuation and special characters, which is supported by Ghost. It allows me to focus on expressing my thoughts, and not on bells and whistles in the UI.
4. Easy to hack. I am a programmer and like being able to add features that I want, hopefully in the form of extra code without having to hack into the core of Ghost (for otherwise it will be harder to upgrade Ghost later).
5. Node.JS. While the use of Node.JS on the server side is moot once it got converted into a static website, it is nonetheless handy if you wish to add some advanced features (such as push notification, real-time communication, etc.) one day.

Overall I thought it was worth the effort converting my Ghost instance into a static website, and relying on external services for anything that might require server-side logic. I would recommend that you give it a try, assuming that you are at ease with doing some light coding in JavasSript/CSS.

###### Open questions

1. Where do I put my own code under Ghost? While customizing Ghost to meet my own needs I have added a number of JavaScript, CSS, and data files. Somehow these files are not accessible on production from a browser unless they are placed under /content/images, which is an odd place for these files.

Separately I have also made many enhancements over the standard Ghost. You can find the details about these here.

###### Other resources
1. The Ghost publishing platform.
2. Node.JS. NodeJS is required for running Ghost. Ghost recommends Node v4 LTS.
4. Extending the Ghost Default Theme with a Sidebar, Social Navigation Links, Disqus Comments & a Contact Form
5. Mail Configuration on self-hosted version of Ghost
6. How to add class in image markdown in Ghost - this is useful for inserting images in the post and then using CSS to style their looks for best effect (e.g., with wrap-around text, better size, etc.)
8. Markdown Guide
10. AWS Command Line Interface - useful for uploading static files acquired using Buster to the Amazon AWS S3 from command line, which is much easier to automate the process.
]]>
<![CDATA[How to build a Holodeck - part 1]]>http://www.k4ai.com/holodeck/a8b6aef9-4316-4024-81db-39a0117a50c6Thu, 04 Aug 2016 14:26:00 GMT

You may have seen the Holodeck device that appeared in the TV series Star Trek: The Next Generation, where a user goes into the Holodeck, issues verbal instructions, and entirely realistic 3D objects or environment would appear instantly. This post explores the relevant AI technology needed in order to support such a vision.

Here we use the Holodeck as an example, but in general terms this is in fact in line with what I am trying to achieve with the terraAI (a.k.a. TAI) project, in the sense that both require some kind of interactive and intelligent knowledge-based assistant for getting something useful done.

For ease of reference we shall call this system HAI (because it is in fact an application of the TAI platform for dealing with 3D objects).

For the type of watered-downed Holodeck described in this post we shall call it the HAI Holodeck, to distinguish it from the well-known TNG Holodeck.

###### Warnings

This Holodeck thread is a serious investigation from the AI (artificial intelligence) perspective on how to realize a holodeck-like experience.

Many previous attempts at building a Holodeck can be found over the Internet, which are almost invariably focused on the more flashy areas related to visualization, scanning, and sensing. This thread eschews most of those, and focus instead on building a foundation for the less visual but nonetheless vital part needed for supporting the Holodeck experience.

As stated in the TAI Manifesto, here we adopt a top-down design approach which is a common in large-scale commercial software design. As such you will find plenty of to-be-filled stubs as we go, since this is NOT a step-by-step how-to recipe.

You should stop reading here if you have received no training in Artificial Intelligence, since I will go into some heavy technical stuff that could be hard to follow for people without a background in AI.

###### Current state towards the Holodeck

There have been great progress made on many fronts:

1. On the hardware side it would appear that soon we will have devices that are capable of rendering highly accurate and realistic AR/VR scenery.
2. There are techniques such as the redirected walking which allows limited physical space to feel like much larger. This allows a users to have the illusion of walking freely around a large neighborhood while inside a relative small room wearing VR goggles.
3. Gesture control, using devices such as Leap Motion, is maturing.
4. Many haptic technologies are becoming available for recreating the sense of touch.

However, currently the virtual worlds shown in AR/VR demos are painstakingly handcrafted over long period of time, and the HAI Holodeck discussed here is intended to address this deficiency by making it exceeding easy to build a wide variety of 3D objects and environments, following the inspiration as given by the TNG Holodeck.

###### Our approach

We want the HAI system to be so easy to use that almost anybody is able to use it. This decision actually alters the design approach in some fundamental ways, including how 3D objects are represented, how knowledge is acquired by HAI, how a human user interacts with HAI in order to get something built, etc.

For the purpose of this HAI Holodeck thread, we will make the following simplifying assumptions:

1. What we will leave out:

1. We will leave the hardware side to the likes of Kinect, Oculus Rift, Magic Leap and HoloLens, and dovetail with those later when we are ready.
2. We don't do holograms here, so users still need to wear AR/VR goggles in order to see the 3D world that was built, unlike the Holodeck on TV.
3. We are not concerned with middleware 3D graphics rendering engine technologies, such as the types of things that Euclideon deals with.
4. We don't deal with Star Trek replicator or transporter technologies here (just so that is no misunderstanding).
5. We will ignore the gesture control or haptic technology for now.
6. We will ignore issues related to the navigation in the virtual world for now.
7. We are not concerned with 3D scanning technologies for the purpose of acquiring the 3D models needed for the target object or environment.
8. We don't create seemingly sentient human-like characters or complex machinery. Rather, for now we will focus on the simpler tasks of creating static objects and environments as a start.
2. What we will do:

1. We will focus on constructing 3D objects or environments (referred to jointly as 3D models below) based on verbal commands from a user. In other words, our focus here is on how to allow complex 3D models to be built from simple user-system interaction, using a knowledge-driven approach.
2. We will aim to make HAI Holodeck so easy to use that almost anyone is able to use it.
3. We will start with a relatively simple and small target 3D world for now. We will design it for the long term so one day you can ask HAI to build something elaborate such as an entire 19th-century London neighborhood with all the details, but that's for later.
4. We want to make it crowd-driven, in the Wikipedia spirit, so that domain experts in all areas eventually can help with building up an elaborate knowledge base for the HAI Holodeck, even if they don't know anything about AI or VR/AR technologies. Put another way, we aim for the Wikipedia model in terms of its richness and open contribution from the public.
3. Ultimately we aim to create an open, rich, and ever-evolving 3D knowledge base.

More specifically, we adopt the following technical approach:

1. HAI interacts with the user through a natural language (NL) interface, so that it is possible to figure out what the user wants interactively.
2. Knowledge-guided. It is assumed that a large knowledge base is put in place, which contains detail information about the mapping between textual requests and corresponding 3D objets (e.g., what a typical car looks like, what a typical sports car looks like, etc.), as well as all kinds of background knowledge, etc. This allows HAI to offer partial solution based on little information, but then interact with user and guided by prior knowledge to converge on the target model that user wants.
3. Machine learning assisted. We use supervised and unsupervised machine learning methods to acquire the large knowledge base to do the work effectively.
4. Overall we take a top-down design approach. By taking a page from large-scale commercial software design, we start by working out a skeletal architecture with its requisite components, as well as the requirements for each of its components. We may defer the detail design within each components until later. This is also a way to solicit contribution from the research community, so that if someone come up with a new algorithm that fits the requirements, then we can quickly fit it into this grand HAI Holodeck design and have an instant upgrade. See here for more details

Put another way, our approach here is to focus on building a huge knowledge base about our world, with the assistance of machine learning and the general public, and then use these to drive a HAI Holodeck that makes the creation of 3D objects or environments exceedingly easy.

These are further explained separately in the sections below, where we will also try to find ways to make further simplifications for the initial phase.

###### NL user interface

HAI interacts with user for the following goals:

• G1: acquire user's initial instruction, and consequently respond with a list of candidate objects $\{C_i\}$ (which are likely off, thus require further refinement). This can be viewed as a matter of knowledge retrieval for objects that match the description.
• G2: acquire user's instruction for modifying the candidate objects. This can be viewed as a matter of retrieving and applying operational knowledge that satisfies the stated goal.
• G3: acquire additional information about $\{C_i\}$. This can be viewed as a form of supervised learning related to the 3D objects, assisted by the user.
• G4: acquire the linguistic terms, expressions, and convention that user employs. This can be viewed a form of language learning, assisted by the user.
###### Knowledge-guided interaction

The goals G1 and G2 above are guided by information stored in the knowledge base. What's special about this repository that we call the knowledge base is that most of its content can be acquired through a machine learning module.

How such knowledge is used to achieve the goals G1 and G2 are described in a separate upcoming post Knowledge Management.

###### Knowledge acquisition

Machine learning plays a pivotal role in this HAI system. The 3D and knowledge representations are the underpinning of the entire system, and the immense amount of content for these must be acquired largely through a machine learning module.

As described above, achieving the goals G1 and G2 requires the support of a knowledge base KB, and the content of KB must be populated by a machine learning module, either all by itself (i.e., unsupervised), sometimes with the assistance of human trainers (i.e., supervised)

Furthermore, this machine learning module is also pivotal for achieving goals G3 and G4.

These are discussed further in an upcoming post Knowledge Acquisition.

###### Crowd-driven knowledge acquisition

Building a knowledge base for supporting a HAI Holodeck requires huge amount of resources, even for a relatively small target domain. The resources are required in several areas:

1. All probability distribution of the 3D objects in the target space that need to be acquired.
2. All the minute details the 3D objects in the target space that need to be acquired.
3. The unsupervised pre-training using video training examples.
4. Supervised learning for understanding the numerous categorizations.
5. What else?

While this part is not crucial for producing the first proof-of-concept HAI Holodeck, it is nonetheless vital to its long-term viability. It is helpful to think of the HAI system supported by a legion of domain experts who are able to make in-depth and persistent contributions to it, similar to what we see in the crowd-contributed Wikipedia, even if such domain experts know nothing about the underlying technology.

This topic is discussed in greater detail in another upcoming post The Crowd-driven Holodeck.

###### Putting things together

So far we have depicted a grand design with many empty stubs. Next let's dissect the basic user interaction flow as follows:

1. User issues a request for a certain object.
Here HAI must search through its knowledge base KB to find the a set of best candidates $\{C_i\}$, possibly also needing to make some alteration of what's in the KB, and present them to the user for selection.
2. User select one, C, and make suggestion (likely somewhat vague) on how to further modify C.
3. HAI infer user's intention based on background knowledge from its KB to produce a updated candidate, and presents it to the user
4. If user is satisfied then stop.
5. Else user replies with an additional request to modify the candidate. Go to step 2 above.
###### Simple use case #1

The goal of this use case is for HAI to help a handyman create the 3D model of a simple 2-step staircase.

The dialog between the two might go like this:

1. Handyman: HAI, give me a staircase.
2. HAI: (showing the most common type, for lack of information) is this what you wanted?
3. Handyman: I want the type suitable for a porch.
4. HAI: (Showing a typical 4-step porch stair) How's this?
5. Handyman: I need only two steps.
6. HAI: (Showing a typical 2-step porch stair) How's this?
7. Handyman: Better. I want the total height to be 22 inches, each step is 1 inch plank.
8. HAI: (adjusts the spacing under each step to 10 inches) How's this?
9. etc.

If we analyze this dialog in light of our approach earlier, we can see that:

1. HAI's knowledge base needs to contain information such as:
1. All kinds of stairs categorized by certain labels (e.g., straight stairs, winder stairs, stairs with intermediate landing, etc.), also the the typical types (e.g., most porch stairs have four steps), etc.
2. Composition information, such as a stair has multiple steps, may have railings.
2. HAI is able to manipulate sub-components individually.
3. HAI has some capability for spatial reasoning, so it is capable of computing the dimensions of spacing etc from the given information.
4. Even though we are deferring further discussion about knowledge acquisition process for now, it is clear that we definitely need to work out further details regarding knowledge representation.

These will be discussed further in separate posts.

TO BE FILLED

###### Summary

Here we have worked out a very rough skeleton for realizing this HAI Holodeck, and reduced the problem to a couple of core issues. As mentioned earlier, since we are taking a top-down design approach, our primary concern is in working out a suitable architecture, eventually down to the detail specifications for the major components, with the intention that some of the components may even be contributed by the research community.

There are still a great deal of important details that need to be worked out, in particular in the following areas:

1. How to define a knowledge representation scheme for complex 3D objects and environments using CNN.
2. How to represent 3D objects, in particular in a way that is conducive to machine reasoning and learning.
3. How to acquire and accumulate knowledge.

To see further discussions on this How to build a Holodeck thread, read the following posts on this topic:

1. (Upcoming) Knowledge representation
2. (Upcoming) From CAPTCHA to Holodeck
3. (Upcoming) User interface
4. (Upcoming) Knowledge acquisition
5. (Upcoming) Knowledge management
6. (Upcoming) A Crowd-driven Holodeck
7. (Upcoming) Putting everything together
1. The terraAI Manifesto
2. The terraAI Design Overview
3. The terraAI Knowledge Management (upcoming)
4. The Untold Story of Magic Leap, the World’s Most Secretive Startup
5. You Can’t Walk in a Straight Line—And That’s Great for VR
6. How objects are represented in human brain? Structural description models versus Image-based models
7. Recurring Star Trek Holodeck Programs, Ranked, just in case that you wish to watch the Star Trek Holodeck programs again.
]]>
<![CDATA[Help]]> This post contains the master index to help you find all kinds of information regarding the terraAI project.

###### Index for this post
]]>
http://www.k4ai.com/toc/a854c3d5-1436-41da-9bb7-a47c4d4b1ebfTue, 02 Aug 2016 15:02:00 GMT

This post contains the master index to help you find all kinds of information regarding the terraAI project.

###### Index for this post

This is a technical blog on the topic of Machine Learning, which also serves as a working document for the terraAI project. The goal of the terraAI project is to implement a crowd-driven platform for building up a crowd-contributed and ML-assisted textual/visual knowledge base about our world, as well as the to create useful intelligent tools for personal use. It is a working document in the sense that it is written to facilitate the design and actual implementation of the project.

This blog is also a tool for communicating with knowledgeable readers, and with your input this blog will be constantly updated accordingly. You can help us using any of the following methods:

1. Express your opinion in the Comments section at the bottom of each post.
2. Send email to me at kaihu@smesh.net
3. Using any of the crowd socialization mechanisms as provided through out this blog. Many of such mechanisms will be also incorporated into TAI in order to achieve our design goal of having it crowd-driven. Here are some examples:
1. The Crowd Ranking widget. This is for you to voice your opinion regarding a specific topic, and be ranked by others. This is a free online tool offered by the crowd socialization platform sMesh Central. The following is a live and real-time demo. Try it out!
2. The Crowd Debate widget. This is for two sides to debate against each other in a control and crowd-driven manner. A visitor can enter a short paragraph of argument to support one side of the debate. Such arguments are then voted on by the entire community and ranked based on various criteria. TO BE FILLED.
3. (MORE COMING)

Note that if a post is tagged with draft, then it means that it is not done yet and you may simply ignore it unless you are of the inquisitive type.

###### To participate

If you wish to join the project, please check out this post.

You can find quick answers for some of the frequently asked questions here.

###### How to track changes

This blog is a working document for the TAI project, as such the same post could get published multiple times in order to incorporate new material or changes. The best ways to follow such changes are:

1. To find the list of latest changes, use the TAI Change Log as the starting point. This change log lists all non-trivial changes in the latest-first order, so by inspecting the top of this log you can find what got changed since your last visit. You can then navigate to the target area using the links provided there.
2. Non-trivial changes on a post that was made since your last visit will be highlighted in yellow background.
Note that this is done using a browser's local storage, which means that the time stamp of your last visit will be lost if you somehow cleared the browser's local storage, or if you change to use a different browser.
###### Table of content

For ease of access, following is a table of content categorized by type for this blog, which is mostly about the terraAI, an open-source crowd-driven AI project. You can also select one of the tags to see those articles categorized under that tag.

###### terraAI:
1. Introductory material
2. The terraAI Manifesto
1. Design overview
2. Features of terraAI (upcoming)
3. Crowd-driven socialization
4. User roles (upcoming)
5. Knowledge management (upcoming)
6. Smart query (upcoming)
7. Contributing to the terraAI project
3. Use cases
1. #1
1. #1 find geo-location of image
5. Technical specifications
1. (upcoming)
6. Resources
1. Machine Learning Resources
7. Others
###### Other topics:
1. Blogging with Ghost: my experience using the ghost blogging system for this blog.
###### Infotips

When you see text underlined in pale red dashed line, like this, then you can get additional information displayed in a bubble by hovering (when using a mouse), clicking, or touching it (when using a touch device). We call this bubble an infotip.
An infotip is not just a simple tooltip displaying text in it, but rather it is capable of showing all kinds of media (rich text, videos, images, audios, maps, etc.), entire article, or even user interface in it for helping you with something.

Here are some examples:

2. TO BE FILLED
###### Glossary
• AI: Artificial Intelligence.
• AR: augmented reality.
• KB: knowledge Base.
• KR: knowledge representation.
• The live onion architecture: the architecture that allows other developers to add their own code modules to a live system, in order to override the default behavior. This facilitates large-scale online development, since a developer can then test out his/her own alternative tweaks with very little effort. Aside from software code, knowledge should also allow layering following the same principle.
• ML: Machine Learning.
• NLP: natural language processing.
• OCS: On-site client-centric socialization.
• TAI: terraAI, the crowd-driven planet-scale AI platform discussed in this blog.
• UI: user interface.
• VR: virtual reality.
###### Learning resources for beginners

This section contains pointers to various resources.

1. To Be FILLED

1. Running Python in a Browser Is More Awesome Than You Think - might be helpful to those who wish to learn Python, which is commonly used in the field of Machine Learning (ML). ML is pivotal in the TAI project.
2. An Introduction to Interactive Programming in Python
###### How to find things
1. Using Tags
The tags on each page of this blog give you an easier way to find the posts of interest to you.
Following is a partial list of the meaning of such tags:

1. AI: posts that discuss issues in the realm of Artificial Intelligence.
2. crowd: posts related to the Crowd Socialization concept.
3. intro: written in plain English (well, as much as possible) for the general public.
4. tech: these are of the technical stuff, and you probably will need some background in software
or Artificial Intelligence in order to follow.
5. design: posts that delve into design issues. These are casual readings for technical people, but perhaps a little hard to the general public.
6. ideas: posts that are somewhat forward-looking (or impractical, or head in the cloud, or useless, depending on your point of view).
7. internet: topics related to the internet, likely in the context of TAI, socialization, or some new ideas.
8. TAI: posts related to the terraAI project.
9. KR: posts related to Knowledge Representation.
10. KB: posts related to Knowledge Management.
2. Using text search
This is not yet working.

On the history page you can see the list of posts that you have visited, displayed in the latest-newest order. This is useful for you to resume reading of a long post at a later time, etc. You can find it here.

You can also find a shortcut button at the top of each post, inside the slide-out navigation bar. This allows you to go to your history page with just a swipe or up-scroll (to get the navigation bar to show up) then a click on the button.

To be filled

###### Change log

To be filled

To be filled

]]>
<![CDATA[ML Resources]]>http://www.k4ai.com/mlstuff/7e61506c-0504-457e-8d69-093cdc872149Sat, 30 Jul 2016 18:05:39 GMT

This post is for cataloging those online resources that are useful to my work for the terraAI project, in particular those related to Machine Learning. Hopefully these will also be useful to other Machine Learning researchers.

##### Typesetting math formulas

It would seem that the best way to typeset formulas, which is useful when discussing topics related to Machine Learning, is to use MathJax.

Typesetting in Ghost. We will use the Bellman equation as an example below. The Bellman equation looks like this:

$V_x = \min_u \left[ r_{xu} + \gamma V_f \right]$

This equation in the Latex syntax normally looks like the following:

$V\_x = \min_u \left[ r_{xu} + \gamma V_f \right]$


Which cannot be entered into Ghost directly, since the backslashes and underscores will be consumed by the system. This can be handled by manually adding additional backslashes before those special characters, as follows:

\$V\_x = \\min\_u \\left[ r\_{xu} + \\gamma V\_f \\right]\$


Personally I prefer to use some Javascript to automate this process, so that all I have to do is to cut-and-paste the Latex code acquired from elsewhere into a pre tag as follows:

<pre class="mathjax">$V\_x = \min_u \left[ r_{xu} + \gamma V_f \right]$</pre>


The Javascript code will then scan the entire article, does the necessary processing and render it correctly.

Random samples

Following are random samples, kept here just for my own convenience as a reference. More samples can be found in the Resources by inspecting the source code for the respective webpages.

WaveNet


$p({\bf x}|{\bf h}) = \prod_{t=1}^{T}{p(x_t|x_1,...,x_{t-1}, {\bf h})}$


$$$\tag{5五} \mathrm{z} = tanh(W_{f,k} * \mathrm{x}) \odot σ(W_{g,k} * \mathrm{x})$$$


The skip-gram model


$\arg\max \limits_\theta \prod_{w\in Text}{\left[ \prod_{c\in C(w)} p(c|w;\theta) \right]}$

<pre class="mathjax">$\arg\max \limits_\theta \prod_{w\in Text}{\left[ \prod_{c\in C(w)} p(c|w;\theta) \right]}$


The alternative skip-gram model


$\arg\max \limits_\theta \prod_{(w,c)\in D}{ p(c|w;\theta) }$

<pre class="mathjax">$\arg\max \limits_\theta \prod_{(w,c)\in D}{ p(c|w;\theta) }$</pre>


The TAI knowledge-based "skip-gram" model


$\arg\max \limits_\theta \prod_{w\in Text}{ \left[ \prod_{c\in C(w)} \left[ \prod_{k\in K(c,w)} p(k|w;\theta) \right] \right] }$


$H_0: \mu_{A} = \mu_{B}$

\$H\_0: \mu\_{A} = \mu\_{B}\$


$f(a) = \frac{1}{2\pi i} \oint_\gamma \frac{f(z)}{z-a}, dz$

\$f(a) = \\frac{1}{2\\pi i} \\oint_\\gamma \\frac{f(z)}{z-a}, dz\$


\begin{align} Q_{xu} &= r_{xu} + \gamma V_f \\ &= r_{xu} + \gamma \min_{u'} Q_{fu'} \end{align}

\\begin{align}
Q\_{xu} &amp;= r_{xu} + \\gamma V\_f \\\\
&amp;= r\_{xu} + \\gamma \\min\_{u'} Q\_{fu'}
\\end{align}


$\hat{V}_x \leftarrow \min_u \left[ r_{xu} + \gamma \hat{V}_f \right]$

\$\hat{V}\_x \\leftarrow \\min\_u \\left[ r\_{xu} + \\gamma \\hat{V}\_f \\right]\$


$r_{xu} = Q_{xu} - \gamma \min_{u'} Q_{fu'}$

\$r\_{xu} = Q\_{xu} - \\gamma \\min\_{u'} Q\_{fu'}\$



$\hat{y} = \arg\max \limits_y P(y)\sum_{i = 1}^{M}{\log P(x_i|y)}$

<pre class="mathjax">$\hat{y} = \arg\max \limits_y P(y)\sum_{i = 1}^{M}{\log P(x_i|y)}$</pre>


$$x = {-b \pm \sqrt{b^2-4ac} \over 2a}.$$ $$x = {-b \pm \sqrt{b^2-4ac} \over 2a}.$$

$r_{xu} = Q_{xu} - \gamma \min_{u'} Q_{fu'}$

The following expression only works in this blog:

<pre class="mathjax">$r_{xu} = Q_{xu} - \gamma \min_{u'} Q_{fu'}$</pre>


Set expression, here is the inline version ${f'_i}$ and $\{f'_{i+1}\}$
\${f'\_i}\$

and following is the display version :
$$\{f' _i\}$$ $$\\{f' _i\\}$$

##### ML Code Libraries

These are useful as the code base for adding the learning capability. Candidates:

1. TensorFlow
2. Keras
3. Karpathy. Andrej dir Karpathy's various Javascript-based machine learning systems could be used as the basis for supporting in-browser machine learning, which is highly desirable for improving system scalability.
4. The OpenAI Gym toolkit
5. mxnet: for supporting deep learning on mobile devices. Blurb from literature: lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go.
6. Brain: a JavaScript neural network library
7. Python Natural Language Toolkit. NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
8. MORE
##### ML learning resources

Following are some introductory Machine Learning material that might be useful to beginners in this area:

1. Word2vec: Neural Word Embeddings in Java by DeepLearning4j
2. MORE
##### Diagramming tools
1. Drawing a neural network - this uses JavaScript and CSS
2. D3.js : a JavaScript library for manipulating documents based on data. D3 helps you bring data to life using HTML, SVG, and CSS. D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation.
3. Python script for illustrating Convolutional Neural Network (ConvNet). Example:
4. MORE
##### Resources
1. MathJax: a library for displaying math formulas, useful when discussing machine learning algorithms.
2. How to display mathematical equations in Ghost
3. Tools and examples of ML-related math formulas
4. How to insert 'LATEX' text dynamically in html
5. MORE
]]>