This article on information extraction is authored by Laura Chiticariu and Yunyao Li.

We are all hungry to extract more insight from data. Unfortunately, most of the world’s data is not stored in neat rows and columns. Much of the world’s information is hidden in plain sight in text. As humans, we can read and understand the text. The challenge is to teach machines how to understand text and further draw insights from the wealth of information present in text. This problem is known as Text Analytics.

An important component of Text Analytics is Information Extraction. Information extraction (IE) refers to the task of extracting structured information from unstructured or semi-structured machine-readable documents. It has been a well-known task in the Natural Language Processing (NLP) community for a few decades.

Two New Information Extraction Courses

We just released two courses on BDU that get you up and running with Information Extraction in no time.

The first one, Text Analytics – Getting Results with System T introduces the field of Information Extraction and how to use a specific system, SystemT, to solve your Information Extraction problem. At the end of this class, you will know how to write your own extractor using the SystemT visual development environment.

The second one, Advanced Text Analytics – Getting Results with System T goes into details about the SystemT optimizer and how it addresses the limitations of previous IE technologies. For a brief introduction to how SystemT will solve your Information Extraction problems, read on.

Common Applications of Information Extraction

The recent rise of Big Data analytics has led to reignited interest in IE, a foundational technology for a wide range of emerging enterprise applications. Here are a few examples.

Financial Analytics. For regulatory compliance, companies submit periodic reports about their quarterly and yearly accounting and financial metrics to regulatory authorities such as the Securities and Exchange Committee. Unfortunately, the reports are in textual format, with most of the data reported in tables with complex structures. In order to automate the task of analyzing the financial health of companies and whether they comply with regulations, Information Extraction is used extract the relevant financial metrics from the textual reports and make them available in structured form to downstream analytics.

Data-Driven Customer Relationship Management (CRM).  The ubiquity of user-created content, particularly those on social media, has opened up new possibilities for a wide range of CRM applications. IE over such content, in combination with internal enterprise data (such as product catalogs and customer call logs), enables enterprises to have a deep understanding of their customers to an extent never possible before.Besides demographic information of their individual customers, IE can extract important information from user-created content and allows enterprises to build detailed profiles for their customers, such as their opinions towards a brand/product/service, their product interests (e.g. “Buying a new car tomorrow!” indicating the intent to buy car), and their travel plans (“Looking forward to our vacation in Hawaii” implies intent to travel) among many other things.  Such comprehensive customer profiles allow the enterprise to manage customer relationship tailored to different demographics at

Besides demographic information of their individual customers, IE can extract important information from user-created content and allows enterprises to build detailed profiles for their customers, such as their opinions towards a brand/product/service, their product interests (e.g. “Buying a new car tomorrow!” indicating the intent to buy car), and their travel plans (“Looking forward to our vacation in Hawaii” implies intent to travel) among many other things. Such comprehensive customer profiles allow the enterprise to manage customer relationship tailored to different demographics at

Such comprehensive customer profiles allow the enterprise to manage customer relationship tailored to different demographics at fine granularity, and even to individual customers. For example, a credit card company can offer special incentives to customers who have indicated plans to travel abroad in the near future and encourage them to use credit cards offered by the company while overseas.

Machine Data Analytics. Modern production facilities consist of many computerized machines performing specialized tasks. All these machines produce a constant stream of system log data. Using IE over the machine-generated log data it is possible to automatically extract individual pieces of information from each log record and piece them together into information about individual production sessions. Such session information permits advanced analytics over machine data such as root cause analysis and machine failure prediction.

A Brief Introduction to SystemT

SystemT is a state-of-the-art Information Extraction system. SystemT allows to express a variety of algorithms for performing information extraction, and automatically optimizes them for efficient runtime execution. SystemT started as a research project in IBM Research – Almaden in 2006 and is now commercially available as IBM BigInsights Text Analytics.

On the high level, SystemT consists of the following three major parts:

1. Language for expressing NLP algorithms. The AQL (Annotation Query Language) language is a declarative language that provides powerful primitives needed in IE tasks including:

  • Morphological Processing including tokenization, part of speech detection, and finding matches of dictionaries of terms;
  • Other Core primitives such as finding matches of regular expressions, performing span operations (e.g., checking if a span is followed by another span) and relational operations (unioning, subtracting, filtering sets of extraction results);
  • Semantic Role Labeling primitives providing information at the level of each sentence, of who did what to whom, where and in what manner;
  • Machine Learning Primitives to embed a machine learning algorithm for training and scoring.

2. Development Environment. The development environment provides facilities for users to construct and refine information extraction programs (i.e., extractors). The development environment supports two kinds of users:

  • Data scientists who do may not wish to learn how to code can develop their extractor in a visual drag-and-drop environment loaded with a variety of prebuilt extractors that they can adapt for a new domain and build on top of. The visual extractor is converted behind the scenes into AQL code.

Information Extraction

  • NLP engineers can write extractors directly using AQL. An example simple statement in AQL is shown below. The language itself looks a lot like SQL, the language for querying relational databases. The familiarity of many software developers with SQL helps them in learning and using AQL.

AQL Information Extraction

3. Optimizer and Runtime Environment. AQL is a declarative language: the developer declares the semantics of the extractor in AQL in a logical way, without specifying how the AQL program should be executed. During compilation, the SystemT Optimizer analyzes the AQL program and breaks it down into specialized individual operations that are necessary to produce the output.

The Optimizer then enumerates many different plans, or ways in which individual operators can be combined together to compute the output, estimates the cost of these plans, and chooses one plan that looks most efficient.

This process is very similar to how SQL queries are optimized in relational database systems, but the types of optimizations are geared towards text operations which are CPU-intensive, as opposed to I/O intensive operations as in relational databases. This helps the productivity of the developer since they only need to focus on “what” to extract, and leave the question of the “how” to do it efficiently to be figured out by the Optimizer.

Given a compiled extractor, the Runtime Environment instantiates and executes the corresponding physical operators. The runtime engine is highly optimized and memory efficient, allowing it to be easily embedded inside the processing pipeline of a larger application. The Runtime has a document-a-time executive model: It receives a continuous stream of documents, annotates each document and output the annotations for further application-specific processing. The source of the document stream depends on the overall applications.

Advantages of SystemT

SystemT handles gracefully requirements dictated by modern applications such as the ones described above. Specifically:

  • Scalability. The SystemT Optimizer and Runtime engine ensures the high-performance execution of the extractors over individual documents. In our tests with many different scenarios, SystemT extractors run extremely fast on a variety of documents, ranging from very small documents such as Twitter messages of 140 bytes to very large documents of tens of megabytes.
  • Expressivity. AQL enables developers to write extractors in a compact manner, and provides a rich set of primitives to handle both natural language text (in many different languages) as well as other kinds of text such as machine generated data, or tables. A few AQL statements may be able to express complex extraction semantics that may require hundreds or thousands lines of code. Furthermore, one can implement functionalities not yet available via AQL natively via User Defined Functions (UDFs). For instance, developers can leverage AQL to extract complex features for statistical machine learning algorithms, and in turn embed the learned models back into AQL.
  • Transparency. As a declarative language, AQL allows developers to focus on what to extract rather than how to extract when developing extractors. It enables developers to write extractors in a much more compact manner, with better readability and maintainability. Since all operations are declared explicitly, it is possible to trace a particular result and understand exactly why and how it is produced, and thus to correct a mistake at its source. Thus, AQL extractors are easy to comprehend, debug and adapt to a new domain.

If you’d like to learn more about how SystemT handles these requirements and how to create your own extractors, enroll today in Text Analytics – Getting Results with System T and then Advanced Text Analytics – Getting Results with System T.