Word2vec is an invaluable tool for finding hidden structure in a text corpus. It is essential for the TAI's IPA project, but we will also need to add some refinements over the standard Word2vec in order to meet our needs.
This post is part of the TAI thread, which explores how to design and implement the terraAI (a.k.a. TAI) platform. This post is also part of the IPA sub-thread which is focused on issues related to applying the TAI platform to create an Intelligent Personal Assistant.
We explore in this post the additional refinements that are needed on top of the standard Word2vec in order to make it usable in the TAI project, which eventually will lead to detailed implementation specifications.
This post is also a call to the research community for contribution of insightful comments and development effort. Please read here for the benefits of participating in this project.
A quick recap about TAI
TAI, abbreviated from the name terraAI, is a knowledge-based crowd-driven platform for acquiring knowledge about our world, as well as for serving certain practical purposes, through interaction with its users. This blog a working document for sorting out the design and implementing issues for the TAI platform. More information about TAI can be found in the TAI Manifesto.
TAI's target operating environment is as follows:
- Online. TAI operates over the Internet, support the socialization and collaboration of its participants, and also has much of its source of learning material acquired over the Internet.
- Highly-distributed machine learning. We want machine learning to occur in a highly distributed fashion for several reasons:
- Alleviate the bottleneck on a central server and make the overall system more scalable.
- Realize TAI's design goal of supporting the crowd-driven knowledge acquisition model.
- We want to allow TAI to be highly customizable and trainable towards each user's particular needs and habits.
- Internet-based user interface, which are mainly in the form of web browsers or mobile devices.
- It is assumed that a default Word2vec skip-gram model is provided by the system. Additional incremental training might be required to satisfy an end-user's needs, which typically will occur on the client side.
About the IPA
As mentioned earlier, the IPA (i.e., the Intelligent Personal Assistant) project is an application of the TAI platform, where we seek to build an personal assistant that is capable of adapting to user's requirements and idiosyncrasies through supervised and unsupervised learning, and satisfy user's needs of information processing and various tasks over the Internet. The ultimate goal of the IPA project is to use the power of the crowd to create a long-lasting knowledge base about our world, while eliminate concern about privacy-related issues.
Another TAI application under consideration can be seen in the Holodeck sub-thread.
About Word2vec
MORE
Using Word2vec in IPA
Why do we need Word2vec in the IPA project? For IPA, the target domain of discourse in the Internet, meaning that IPA will need to learn and perform tasks based on material available over the Internet, such as unstructured webpages, documents, user's behavior in a web browser, etc. Word2vec (or its derivatives) is very good at capturing the syntactic and semantic relationship hidden in an unstructured textual document through unsupervised learning, as such it is invaluable as a tool for converting unstructured documents into a more meaningful representation for further processing.
Overall requirements as per IPA
Following are some refinements needed over the standard Word2vec for the IPA project:
- Need to support client-side training. This is per TAI's distributed design approach. This means that we need it to run on a typical web browser (i.e., written in JavaScript) or mobile devices (i.e., Java for Android or Objtive-C for iOS devices). For simplicity sake, let's aim for JavaScript as a start.
- Automatic tokenization. With the standard Word2vec it is assumed that some tokenization will occur prior to the actual model training. For example, during this process the text string 'New York' will get converted to one token and then treated as an atomic element henceforth. In other words, even though the string New York technically are two words in English, we must treat it as one word in the context of Word2vec.
This requirement for tokenization prior to model training is in fact a severe impediment when attempting to use Word2vec in the context of certain real-world applications. For example, if the input comes from news feeds, then we are going to find new words all the time, such as the name of someone who just become famous (say, Jeremy Lin), and the system wouldn't know how to deal with them properly until such names are tokenized and the model is retrained, which is not a quick process. For TAI it actually gets worse, since it needs to deal with HTML code. As such, we need such tokenization to occur automatically, and efficiently, which the standard Word2vec does not support. - Fragmental model: we need the word vector model to be stored in a form that allows a client to download only what it needs quickly. This is needed considering that a model trained on GoogleNews has a size of 1.5 GB compressed, which makes it entirely unusable on an internet-based client device.
If this seems to be an unusual requirement, there are actually precedents for this in the video space. For example, in the early days a video is encoded as a single large file using a certain video codec. But obviously this does not work too well for live video streaming, especially if we want to allow a user to skip around in the video, or replay part of the video, or play in fast forward or slow motion. As a result something called the Dynamic Adaptive Streaming over HTTP was invented, which essentially breaks a large video into small HTTP-based file fragments, and a video client (that is, a video player) gets only what it needs at the moment. Conceptually speaking, the requirements for the fragmental Word2vec model is entirely similar to this, driven by the same desire to find a solution to reach more light-weight clients and also to become more responsive to client requests. - Support token layers. The tokenization of multi-word phrases sometimes also results in the lost of information that is important for performing semantic analysis (which we needed in the TAI project). For example, if the name Jeremy Lin is tokenized then we lose the fact that this person's last name is Lin, which might be important in a certain context. As such, it is desirable to have a phrase tokenized in multiple ways.
- Incremental model training. The standard Word2vec more or less assumes a batch mode of operation. I.e., the system performs tokenization, then performs learning over the input text corpus to produce a word vector model, then use the model for a certain task. If there is a new batch of text material, then the process is essentially repeated which will take quite some time to perform. As such it is not suitable for a more dynamic type of environment like what we need to deal with in the TAI system.
What we needed here is a way to allow Word2vec to accept new training material and perform learning over it dynamically and efficiently.
How to meet the requirements
So how do we propose to meet the requirements listed above? Following are some ideas. If you think you have a better idea, by all means please voice your input in the comments sections below.
- MORE
Going beyond Word2vec
Word2vec has demonstrated that the word embedding approach is able to capture multiple different degrees of similarity between words. Mikolov et al has also found that semantic and syntactic patterns can be reproduced using vector arithmetic. So how can we build on top of it to achieve what we are aiming for the TAI/IPA project?
As a reference, Word2vec's skip-gram model aims to maximize the corpus probability as follows:
\[\arg\max \limits_\theta \prod_{w\in Text}{\left[ \prod_{c\in C(w)} p(c|w;\theta) \right]}\]
given a corpus \(Text\), a corpus of words w and their contexts \(c\), \(C(w)\) as the set of contexts of word \(w\), and \(θ\) as the system's parameters.
For TAP/IPA we assume that there is a knowledge base \(K\) which is built up using various means, such as unsupervised learning, supervised learning, or manual entry. This knowledge base K is to be used for assisting in semantic analysis, for carrying out the requested tasks, and for allowing K to be accumulated and updated through machine learning methods. Mathematically this can be described as follows:
\[\arg\max \limits_\theta \prod_{w\in Text}{ \left[ \prod_{c\in C(w)} \left[ \prod_{k\in K(c,w)} p(k|cw;\theta) \right] \right] } \]
where given \(c\) and \(w\) we wish to find the \(\theta\) for deriving the optimal explanation (i.e., the \(k\)) of the given words.
Further investigations on this topic are discussed in a separate post (upcoming).
Looking ahead
It would be fascinating to extend this approach into multi-modal space, so that it not just about text, but also bringing images, videos, goals, and agent intentions into the picture. This particular aspect is of particular interest to another sub-thread of this blog How to build a Holodeck, which we will explore separately.
References
- The TAI discussion thread in this blog.
- The TAI Manifesto in this blog.
- Wikipedia: Wrod2vec
- Efficient Estimation of Word Representations in Vector Space, the original 2013 Word2vec paper by Mikilov et al
- fastText is an open-source library for efficient learning of word representations and sentence classification. It was created by Mikolov et al at Facebook.
- Bag of Tricks for Efficient Text Classification. This paper proposes a simple and efficient approach for text classification and representation learning.
- Online Word2vec playground: Word2vec Word Vectors in JavaScript
- MORE