One of the more famous IE tasks is identifying terrorist events (specifically South American terrorist events) in documents. For each event, we have to identify the victim(s), date, type of event (bombing, etc.), culprits, and so on. These fields define our extraction template and our goal is to fill it up based on a document. Of course some documents mention no terrorist events and some mention multiple. And not all fields will be mentioned. This data was available for the MUC (message understanding conference) competitions two decades ago and current approaches still only get about 50% accuracy!
One way of going about this problem is as a sequence labeling task, akin to NER. Since you can imagine how this works, we'll talk about the other style of approach: pattern-based methods.
The idea of pattern-based approaches is that our system consists of a collection of extraction patterns, of the form "<Subj> was assassinated" => Victim. These lexico-syntactic patterns tell us how and when to extract certain slots (aka fields) from text. The expressiveness of patterns depends entirely on how much preprocessing you want to do, but usually some sort of syntactic processing is assumed.
The key question is: where do these patterns come from? The trend in IE has been to move toward mostly unsupervised approaches that don't need large amounts of training data. Successful approaches are akin to the Yarowsky algorithm for WSD.
Suppose we had labeled data, where for each document we have one (for simplicity) partially filled template. We can go into the document and find occurrences of the strings in that template as the things we want to extract. For each of these, we can find a bunch of extraction patterns that would potentially extract that string and collect them over the whole data set. We now need to find the "best" ones. A common metric is the "R log F" metric, which is simply the probability that a given pattern extracts the right slot, times log of the frequency of that pattern. The "log F" term is in there because we want to make sure that we get good coverage.
Of course, you needn't start with labeled data. You can start with small lists of slot fillers (eg., Al Queda as a perpetrator and so on) and bootstrap away. As always, the quality of your seeds directly affects how well your algorithm works.
One can get even more unsupervised by doing the following. Take a collection of documents that talk about terrorist events, and a collection of documents that don't. Look for patterns in the terrorist events collection that are significantly more common there, than in the other collection. Rank these by something like "R log F". The top patterns there are often very good extraction patterns, but we don't know what they are supposed to extract. Have a human look down the list of the top 100 and viola, you're done, and it only takes a few minutes.
Most approaches to IE fall into one of these two camps: sequence labeling or pattern based approaches. It seems that sequence labeling approaches work well when most of the words in the text are extracted for something (i.e., turning free text citations into bibtex entries), but pattern based approaches work well for needle in a haystack problems.
There have been a few recent trends in IE:
- Using Wikipedia infoboxes as training data
- Trying to extract knowledge without pre-defined templates (akin to our discussion of mining world knowledge)