Holodeck - Knowledge Representation - part 2

This is part 2 of the Holodeck series, focusing on issues related to knowledge representation (KR). This is a followup on the first post Crowd-driven Holodeck, where a skeletal design was presented for a HAI Holodeck.

A quick recap

To put this post in context (more can be found in the first post:

This series discusses a HAI Holodeck, which is a highly watered-down version of the Holodeck found in the TV series Star Trek - The Next Generation. Here we ignore issues related to realizing goggle-less VR (virtual reality) and any issues on the hardware side.

We settled on a knowledge-driven approach, using machine learning (ML) for knowledge acquisition, and an user interface suitable for drawing assistance from the crowd.

And just in case you haven't figured it out yet, this blog is not a rigorous research paper. It is more like my personal musing on a very serious and difficult topic, which the intention of coming out with some workable directions in the future.

And yes, you do need a solid background in Artificial Intelligence in order to understand what's written in this post.

KR issues

Here we will tackle the following questions:

  1. The kind of knowledge needed for supporting the HAI Holodeck
  2. The mechanism with which such knowledge are represented
  3. The mechanism for populating such a knowledge base, either unsupervised or supervised.
The knowledge needed

First we look at the various types of knowledge that are needed in order to support the HAI Holodeck.

  1. Structural and attribute information about a given 3D object, including its sub-components. For example, we would want to know that apples have skin (endocarp), flesh (receptacle tissues), core (pericarp), and seed, etc. in an onion-like structure.
  2. Probability distribution for object attributes, structural relationships, label-object relationships (i.e., categorizations). For example, we would want to know that most apples are have red skin and white flesh.
  3. User intention. For example, we need to know that if a user requests a blue apple, then the blue attribute most likely is referring to the skin of an apple.
  4. Spatial reasoning. HAI should be aware that normally only the skin of an apple is visible, capable of computing dimension and spacing needed for building a staircase, etc. MORE
  5. **
Knowledge Representation

So how do we represent an apple? What does it take so that when a user requests a blue apple in the size of a watermelon that HAI knows what to do?

It is often an overlooked issue, but it should be made abundantly clear here
KR and ML should be tightly coupled. That is, whatever representation scheme (KR) that we select here, it must be conducive for conducting machine learning (ML).

In recent years the ANN (artificial neural networks) has made great advances, and currently is the indisputable leader among ML technologies.
As of this writing ANN is clearly our best choice for doing knowledge acquisition.

The choice of ANN immediately tilts the balance against many other KR schemes 1, such as First Order Logic (FOL) or Semantic Networks (SN), which cannot be easily coupled with ANN. My personal belief is that we need a KR scheme that is based on ANN, and hopefully this new ANN-KR scheme can be validated to have equal or higher representation power than FOL or SN.

And furthermore, since we are dealing with 3D objects here, this ANN-KR scheme must also account for the representation and recognition of 3D objects, and not just for abstract knowledge. It is my belief that with his joining of abstract and physical knowledge, plus the integration of ML capability, then we will have something very powerful.

The question of course is how to achieve this grand unification of the following goals:

  • Goal#1: Representation of physical knowledge, i.e., 3D objects
  • Goal#2: Representation of abstract knowledge
  • Goal#3: Machine learning for the acquisition of the above knowledge
  • Goal#4: (Later) Inference
  • Goal#5: Achieve all of the above using ANN

And on a side note, I also want this scheme to work fully well for the TAI project, but that's a separate topic.


CNN - to start with

Since we are dealing with visible objects in this HAI Holodeck project, we will start with the standard convolutional neural networks (CNN). The diagram shown above is such an example.

Let's assume that we have done our share of pre-training, so that the lower levels of the CNN contain features that are useful for describing the objects in our target domain (say, fruits, chairs, etc.).

The paper by Dosovitskiy et al Learning to Generate Chairs with Convolutional Neural Networks2 gives us a excellent starting point, where a generative CNN can learn from examples, find a meaningful representation of a 3D chair model, and then generate new style chairs given type, viewpoint, and color.

So let's look at it further from the perspective of KR.

Quoted from the paper:

The last two rows show results of activating neurons of FC-3 and FC-4 feature maps. These feature maps contain joint class-viewpoint-transformation representations, hence the viewpoint is not fixed anymore. 

Here the interesting part is that the FC-3 and FC-4 feature maps contain joint class-viewpoint-transformation representations, so this is a good hint that if we train a CNN on joint information from multiple sources, then we might be able to treat some layers of the CNN as a meaningful representation.

It is worth noting that we listed five separate goals earlier, but the Dosovitskiy paper points to a situation where views and class labels have become co-mingled.

Representing object composition

Given the ANN approach that we are pursuing, how should we represent the composition of an object, such as the fact that chairs typically (but not always) have four legs, a back, a seat, etc.

Imagine that a user requests that a chair \(C_1\) as proposed by HAI be modified, by referring to its parts. Such as:

User: I'd like the chair legs to be in the French Baroque style, but keep the current color and texture.

In this case HAI must perform the following:

  1. Understand which parts of the chair \(C_1\) are legs
  2. Understand what a French Baroque style chair \(C_f\) typically looks like.
  3. Extract the legs of the chair \(C_f\), and transfer their shapes to the legs of \(C_1\)
  4. Ensure that the change is merged with the body of \(C_1\) correctly.

So how are such knowledge represented?

1. (2012) Zhang, et al, A Simple Approach to Describing Spatial Relations in Observer Reference Framework

Spatial reasoning

How do we represent spatial knowledge, for example:

  1. A chair will not fit inside an egg.
  2. A wooden cubical box where each side is one meter long, if the plank for making the box is 0.2 meter think, then the inside of the box is a cube with each side 0.8 meter long.
  3. ...

Having the capability to reason about spatial constraints is of utmost importance to HAI, since it is only with this what it is able to understand what's likely, what's difficult, what's impossible, and making reasonable estimates.

It would appear that so far there are little research done regarding using ANN for performing spatial reasoning, as such we know little about how to represent it in some form in ANN.


Abstract Knowledge

So what about the representation of abstract knowledge, such as chairs are furnitures, or furnitures are usually found indoors, etc?

We take the position that such abstract knowledge must be grounded on top of physical knowledge. Here we are not trying to start a philosophical debate, but rather this is more of a simplifying engineering decision, just so that this project becomes more feasible. Another way to look at it is that we are limiting the type of abstract knowledge that we deal with to only those that are directly or indirectly related to physical knowledge.

More specifically, we deal with the following types of abstract knowledge (for now):

  1. Labels attached by a trainer to an object, or part of an object. These could be categorization labels, or ....
  2. Anonymous labels attached by HAI to an object, or part of an object, during the process of unsupervised learning.
  3. Bayesian rules acquired in the form of posterior probability between labels and observed physical attributes.
  4. MORE?


Representing user intention

How do we represent a user's intention (while ignoring the NLP aspect for now)? It is helpful to see the problem of fulfilling a user's request as a goal-oriented task, where HAI must find ways to achieve the goal, possibly including breaking down a goal into multiple sub-goals. In light of this, a user's intention is then a goal state to be fulfilled by HAI.

Solving a goal-oriented task requires the presence of a set of rules, as well as a mechanism to back-chain over those rules. Here we do not mean that we want to bring in goal-oriented system such as Prolog, but rather we want to find a way to achieve similar goal-oriented behavior using ANN.

Following are some thoughts regarding how to achieve goal-oriented behavior in the context of ANN:

  1. Use some form of ANN to acquire Bayesian distributions among many abstract labels and observed visual facts. These are essentially our rules. For example, we may have a learned rule that indicates Quaker-style chairs have a 90% chance of being brown, even if browness is not useful for any categorization tasks.
  2. We define a new type of top-down mechanism in a trained ANN. Note that this is unrelated to the back-propagation mechanism used during training. MORE COMING

How do we achieve top-down behavior in ANN, and what does it mean? Cao et al uses a top-down mechanism3 to infer the status of hidden neuron activations as a way to control attention. This is in effect a kind of goal-oriented behavior.


Representing 3D objects

What kind of "3D model" are we talking about here? Here it is helpful to distinguish two different ways to represent a 3D object.

  1. Working 3D model: this is what's being used when the system is still interacting with the user and trying to get something built. Here we need a more abstract 3D object model that is suitable for learning, composition, decomposition, piecemeal transformation, showing relationships, etc. I'd argue that for this purpose it is advantageous if we emulate human brain to some extent, and representing a 3D object as a series of salient images in some way.
    Reference: How objects are represented in human brain? Structural description models versus Image-based models
    What are the benefits of the **image-based models? Why not just deal with traditional 3D models throughout? I'd argue that:

    1. The repertoire of useful 3D models is poor, not well indexed, lacks visual details, and lacks contextual information.
    2. Today's image search engines are getting smarter everyday. By relying on images more we then get to piggyback on top of such improvements.
    3. Image search engines give us important clues about the relationship between query text and the resulting images.
    4. We might gain some advantage by emulating how human brain remembers 3D objects (see below).
  2. Run-time 3D model: this is the 3D object representation when we are trying the render into something that a user can see. This could be in the form of one of the popular 3D formats, such as
    3ds, U3D, etc.

Question: which run-time 3D model best suits our purpose? Question: any good argument FOR using the Run-time Format throughout, without using a separate Working Format?

Unless mentioned otherwise, it is assumed that we are always referring to the Working 3D model in this discussion.

Unsupervised learning of 3D models

How do we achieve unsupervised learning of 3D models? This is not so much for learning categorization (which are likely to be supervised), but for the capability of representing 3D objects in HAI's memory, and be able to track and recognize objects even without categorization. Think of it as a kind of pre-training for object recognition and memorization.

Given that we want to go with the image-based for 3D objects, I would argue that during the initial training phase for acquiring background knowledge, it is beneficial to use videos, and not a set of discrete images, as the training samples. This is because:

  1. The time indices in a video training sample contains explicit information about object persistence. For example, if a group of visual features \(\{F' _i\}\) observed at time t1 are sufficiently similar to what's observed \(\{F' _i\}\) in the next video frame, then HAI can safely assume that such features represent the different views of the same object.
  2. The object persistency assumption above also gives us a way to correlate different views of the same object, and register it as a representation for the object.
  3. ...


  1. (2016) Unsupervised Learning of Video Representations using LSTMs, Srivastava et al, University of Toronto
  2. Source code for the paper above
  3. 100 other papers that cite the above paper



In this post we have sketched out a rough skeleton for the Knowledge Representation scheme necessary for supporting the HAI Holodeck project.

We have worked out the following:

  1. Using ANN as basis for representation
  2. How to use video training samples to conduct pre-training, so that the system has the basis for performing object reprosentation and recognition.
  3. How to map incremental textual requirements to the target 3D models, with human guidance and using some form of neural networks. We call such acquired and validated mapping knowledge.
  4. Unsupervised learning for learning probability distribution.

Remaining work:

  1. ...


  1. Wikipedia: Knowledge representation and reasoning

  2. (2015) Dosovitskiy, et al, [Learning to Generate Chairs with Convolutional Neural Networks]4: (2011) Yoshua Bengio, Deep Learning of Representations for Unsupervised and Transfer Learning

  3. (2016), Cao et al, Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks

  4. (201) ,

  5. (2015) Ruslan Salakhutdinov, [Learning Deep Generative

  6. Models](http://www.cs.toronto.edu/~rsalakhu/papers/annrev.pdf) (https://www.robots.ox.ac.uk/~vgg/rg/papers/Dosovitskiy_Learning_to_Generate_2015_CVPR_paper.pdf)
  7. (201) ,

  1. How objects are represented in human brain? Structural description models versus Image-based models
  2. (2016) Aäron van den Oord, et al., Conditional Image Generation with PixelCNN Decoders
comments powered by Disqus