The terraAI Manifesto

Keywords: artificial intelligence, machine learning, deep learning, loosely-coupled large-scale distributed machine learning, knowledge-based, crowd social, knowledge agent discourse, intelligent personal assistant, Holodeck.

This is my first post regarding the terraAI thread.


As Artificial Intelligence technologies matures and its practical use become prevalent (and sometimes invasive) in our lives, its dark sides also begin to emerge. One of such dark sides is the accumulation of public and personal knowledge that are the private property of commercial companies, which may have adverse effects on all of us. The open-source terra•AI project (or TAI for short) is meant to rectify this. More on this below.

Another motivation for the TAI project is to create a highly accessible and Crowd-driven Knowledge Agent Discourse platform that is available online to all. It is called a platform because it can be used to create many different types of practical applications. It is called crowd-driven because we aim to use extensive machine learning and crowd socialization techniques to allow the general public to participate in training the system. It is a Knowledge Agent Discourse system in the sense that the system acts as an intelligent agent that interacts with the user under a knowledge-driven approach in order to fulfill user's requests.

The overall goals for this project is to facilitate the accumulation of a long-lasting knowledge base for the benefits of all, as well as using it to enable the creation of practical and easy-to-use intelligent applications.

The name of the project terra•AI is meant to signify that it is a very large scale open-source crowd-driven AI platform, intended for us all.

Target audience

This blog is meant to be a working document where ideas are presented and discussed (or rebuffed) in order to work out a feasible technical design, with the aim of achieving actual implementation and deployment in the near future. In other words, it will be constantly revised as needed.

I strive to write this blog in plain English wherever possible, so that anybody who finds the topic of interest can follow. I will mark such posts with the tag 'intro' However, as the project progresses some posts will begin to go into heavy technical details, which only software or AI professionals can follow. I will mark those posts with the tag 'tech'.

What is TAI?

The TAI platform can be described as follows:

  1. TAI is an online community and collaborative environment for AI researchers and tool developers.
  2. TAI is an online development platform for AI researchers, as well as developers of intelligent applications.
  3. TAI is open source, so anyone can branch the code for their own purposes if they wish.
  4. TAI offers loosely-coupled large-scale distributed deep learning engines, inference engines (among other things), and crowd-sourcing capability to AI researchers who might need them.
  5. TAI is crowd-driven, so the general public are able to help with training this system, and share the knowledge with all. As such it is very different in spirit from systems such as TensorFlow which is intended for use on the server side strictly by AI experts.
  6. TAI is meant to be an open knowledge repository (in the spirit of Wikipedia but with an AI flavor), including tools such as various machine learning modules and online knowledge spider for building up a knowledge base for the target domain. such a knowledge base can be used to develop many types of intelligent applications.
  7. TAI includes a knowledge spider capability, which allows it to crawl the Internet for available material for use in either supervised or unsupervised learning.
  8. It is not just for developers or AI researchers, but it also offers useful online intelligent applications to the general public.
Sample application domains

A pivotal piece of the TAI system is its distributed machine learning (ML) capability, which is central to the TAI's knowledge acquisition process. Since the usefulness of an ML system is largely dependent on the quality of the provided training samples, it is prudent to select application domains with the following characteristics:

  1. Training samples are abundant, and are of relatively high quality.
  2. It is relatively easy to get crowd participation in online supervised learning.

Currently two target application domains have been identified:

  1. For creating intelligent personal assistant, with the Knowledge Spider capability. See the Design Overview post for further details.
  2. For building a Holodeck. This is a somewhat speculative topic, explored in a separate Holodeck thread.
About the "knowledge" thingy

The term "knowledge-driven" has been loosely mentioned above as a key feature of the TAI system. What exact does the term knowledge mean in the context of TAI here?

More specifically:

  1. We use the term knowledge to indicate codified structural information that can be acquired, modified, and persisted through machine learning methods, either supervised or unsupervised.
  2. New knowledge can inferred from existing ones through inference rules.
  3. Knowledge is modular, in the sense that it is not monolithic and can be applied in new circumstances. For example, we are not interested in an ANN trained to discriminate among five classes of objects with no way to apply it in another setting with a mixture of some new and known classes.
Prior arts and lessons

Today we have Wikipedia, where numerous experts create and maintain its online content. It is accessible by anyone for free instantly, and comes in many languages. It is currently the largest text-based corpus of public knowledge on the planet. It is much better than any proprietary encyclopedia created by a select group of experts working for a for-profit encyclopedia company.

What this teaches us is that when given a suitable platform and incentive, a crowd-driven approach can work much better than a for-profit approach.

Then there are all those online knowledge bases that we don't see. Let's call these the Dark KBs. Such dark KBs are produced by large internet companies with information collected emails, search engines, and your every click online. The dark KBs are getting more sophisticated everyday, and they know about you in a very pervasive and invasive way, and they are the private property of respective companies, thus out of the control of the general public. As mentioned earlier, this is one of motivations for starting the TAI project, since we want to wrest the control of such KBs out of the hands of commercial companies.

Design for feasibility

So it sounds like TAI is going to be a huge project. How can TAI be feasible? Unlike most AI researchers who are laser-focused on solving very specific problems, here we tackle the project from the perspective of large-scale software engineering.

More specifically:

  1. We take the top-down design approach, which is commonly used when building large-scale commercial software. Here we start by working out a skeletal architecture with its requisite components, as well as the requirements for each of its components, even if we don't have all the details about how to implement specific components (such as a machine learning engine, or an inference engine, etc.). This has the following benefits:
    1. The system is more robust in the long run, because components are replaceable so long as they meet the specifications, so work can be more easily distributed to other researchers, and the system can be more easily upgraded when better components become available.
    2. From the perspective of a large-scale open-source project, this also makes it easier to get other contributors involved to work on a specific component with well-defined scope.
    3. From the perspective of coalescing the efforts of the research community, TAI provides a framework for collaboration, for finding interesting problems to solve (e.g., see the enhancements that TAI needs over the standard Word2vec), for being able to see immediately how a small solution works (e.g., a better machine learning algorithm) by plugging it into a large framework, all the while staying focused on a larger common goal (e.g., see an example How to Build a Holodeck).
  2. Online development. Typical open-source projects are shared with the community by publishing the source code. Other researchers are expected to download and install the source code for their own purposes. This gets more burdensome as the scale of a project become larger, because: 1. an adopter has to deal with the entire system even if he/she only wants to work on a specific component in it; and 2. the constantly evolving knowledge base needed to get TAI going is likely to be quite large and hard to manage. 3. Collaboration among researchers gets harder for larger projects, since combining research results are accomplished through incessant code branching.

    We tackle such problems by supporting the live layering capability. That is, TAI allows a contributing developer the option of overlaying his/her work over the live central TAI system, which eliminates all of the problems mentioned above.

  3. Support large-scale social collaboration for the knowledge base. Implementing good algorithms for machine learning or inferences is one thing, accumulating a large and elaborately detailed knowledge base needed for driving a TAI application is an entirely different matter. Here we tackle the problem by allowing the general public to participate in controlled and simplified form of supervised learning, and use a crowd-driven approach to eliminate noisy data points.

It so happens that my entire professional career has been in the areas of Artificial Intelligence, large-scale enterprise software, as well as crowd socialization platform, which gives me a head start on designing such an architecture.

And of course it will take a lot of help from the AI research and developer community as well as the general public to fill in a lot of stubs in this project, and I welcome anyone interested to join me in this endeavor!

Going forward

The ultimate goal of this blog series is to create a functional and useful platform, built and owned by all, and available online for easy access by all. In the latter part of this blog I will lay down the targeted features, design principle, all the way down to technical implementation details.

It is worth repeating here that for this TAI project we have adopted the top-down design approach, where we will first work out a high-level architecture with detailed functional requirements and interface specifications for its components. This should make it much easier for the AI research community to jump in and contribution something large or small, and still allowing the overall project to progress towards a common goal.

Future works are broken down into the following areas. These will be explored in great detail in subsequent posts. The overall structure of this blog can be found in the Index of Contents.

So far we have described the TAI system in high-level terms, which understandably is fairly vague. If you wish to go into more technical details, I'd recommend that you follow one of the two application threads which go into much more details:

  1. The IPA (intelligent personal assistant) thread (upcoming).
  2. The Holodeck thread.

Since this blog is meant to be a working document for a community-based effort, please do not hesitate to enter your comments below if you have something to say.

  1. Q: Is this project actually about creating an open-source intelligent assistant, like Siri, GoogleNow, etc.?
    A: this project is about creating a multi-purpose platform, and Intelligent personal assistant is one of the many possible applications of this system, not the only one.
  2. Q: If this system is used to create an intelligent assistant, what can it do that others (like Siri, Google Now) cannot?
    A: User will be able to perform and train this system's intelligent personal assistant (IPA) for handling highly private affair with no fear of privacy issues. This IPA also can be trained for highly personal purposes.
  3. Q: Artificial intelligence systems like this must involve many things, such as learning, inference, natural language processing, etc. Why are you highlighting the knowledge base (KB) issue above?
    A: This is because KB is the core of such a system, and it is also the part that is most likely to get abused.
  4. Q: Doesn't open-source and crowd-driven mean about the same thing?
    A: The TAI system is both open-source and crowd-driven. Here by open-source we meant that its source code will be released to the public. By crowd-driven we meant that when the TAI platform is operational, we will solicit participation from the general public (who may not be technical) to help with improving this platform. This will include using techniques such as supervised machine learning, crowd socialization, etc.
  5. Q: What is the distributed loosely-coupled machine learning capability?
    A: Simply put, we want to allow the machine learning (ML) process to occur inside a typical web browser, thus the typically CPU intensive ML process can be distributed among many browsers when needed. A casual user can just open a web browser to contribute CPU power to the TAI system. We call it loosely-coupled because here we are not trying to break apart complex ML computation algorithm for distribution to multiple computers, which might require high network bandwidth and high cost in data reconciliation. Rather we aim to find suitable solutions that will allow each ML client to do its own intensive computation and then perform the reconciliation at a much later stage.
Relationship to other projects
  1. OpenAI: As stated, OpenAI’s mission is to build safe AI, and ensure AI's benefits are as widely and evenly distributed as possible. Broadly speaking its goals to some degree is in line with the goal of our efforts here, although currently it's effort is focused on building the basic building blocks targeting coders who wish to build AI systems. In comparison, the TAI effort is closer to the side of the general public. It is quite likely that many of the code modules offered by OpenAI will be useful to this effort.
  2. Wikipedia contains a vast amount of human knowledge which is potentially useful to the TAI project, but those knowledge are not directly usable since they are written in natural languages (e.g., English, Chinese, etc.). TAI will have a semi-automatic module (with crowd validation and contribution), call the Knowledge Spider, for extracting useful knowledge out of Wikipedia.
  3. DBpedia is an effort to extract structured content from Wikipedia that can be used for driving various AI systems. The construction and usage of such structured content requires highly skilled AI practitioners, while TAI will attempt to combine Machine Learning methods with crowd contribution to achieve the same thing, but with much greater ease of use. It is likely that the knowledge in the current DBpedia can be used to bootstrap the TAI effort.
  4. Semantic web. There has been efforts to create structured content for the web (as opposed to using HTML which is mainly about how to display information, and not about what the information means). We opt to not take this route because it is too labor intensive, and also that has not been widely adopted. We believe it is in fact more practical to use Machine Learning methods with crowd assistance to extract information out of existing websites.
comments powered by Disqus