UMD CMSC 723: Computational Linguistics I: November 2010

30 November 2010

Lecture 24: Information Extraction

Information extraction is, roughly, the task of going from unstructured text (aka text) to structured data. Think of it as mapping language to a database.

One of the more famous IE tasks is identifying terrorist events (specifically South American terrorist events) in documents. For each event, we have to identify the victim(s), date, type of event (bombing, etc.), culprits, and so on. These fields define our extraction template and our goal is to fill it up based on a document. Of course some documents mention no terrorist events and some mention multiple. And not all fields will be mentioned. This data was available for the MUC (message understanding conference) competitions two decades ago and current approaches still only get about 50% accuracy!

One way of going about this problem is as a sequence labeling task, akin to NER. Since you can imagine how this works, we'll talk about the other style of approach: pattern-based methods.

The idea of pattern-based approaches is that our system consists of a collection of extraction patterns, of the form "<Subj> was assassinated" => Victim. These lexico-syntactic patterns tell us how and when to extract certain slots (aka fields) from text. The expressiveness of patterns depends entirely on how much preprocessing you want to do, but usually some sort of syntactic processing is assumed.

The key question is: where do these patterns come from? The trend in IE has been to move toward mostly unsupervised approaches that don't need large amounts of training data. Successful approaches are akin to the Yarowsky algorithm for WSD.

Suppose we had labeled data, where for each document we have one (for simplicity) partially filled template. We can go into the document and find occurrences of the strings in that template as the things we want to extract. For each of these, we can find a bunch of extraction patterns that would potentially extract that string and collect them over the whole data set. We now need to find the "best" ones. A common metric is the "R log F" metric, which is simply the probability that a given pattern extracts the right slot, times log of the frequency of that pattern. The "log F" term is in there because we want to make sure that we get good coverage.

Of course, you needn't start with labeled data. You can start with small lists of slot fillers (eg., Al Queda as a perpetrator and so on) and bootstrap away. As always, the quality of your seeds directly affects how well your algorithm works.

One can get even more unsupervised by doing the following. Take a collection of documents that talk about terrorist events, and a collection of documents that don't. Look for patterns in the terrorist events collection that are significantly more common there, than in the other collection. Rank these by something like "R log F". The top patterns there are often very good extraction patterns, but we don't know what they are supposed to extract. Have a human look down the list of the top 100 and viola, you're done, and it only takes a few minutes.

Most approaches to IE fall into one of these two camps: sequence labeling or pattern based approaches. It seems that sequence labeling approaches work well when most of the words in the text are extracted for something (i.e., turning free text citations into bibtex entries), but pattern based approaches work well for needle in a haystack problems.

There have been a few recent trends in IE:

Using Wikipedia infoboxes as training data
Trying to extract knowledge without pre-defined templates (akin to our discussion of mining world knowledge)

23 November 2010

Lecture 23: Rhetorical Structure Theory

So far we've seen flat representations of discourse. RST is an example of a hierarchical discourse representation. Such as:

Here, we've broken some imagined text into 7 "units" and depicted the role of these units in the text. As is implied by the title, this theory is mostly applicable to rhetoric, which is essentially persuasive language.

An RST structure is essentially a dependency tree over elementary discourse units (EDUs), where relations on the edges of the tree tell us the relationship between two EDUs. In the above example, we're saying that EDUs 1-3 provide background to EDUs 4-7. And so on down the tree.

For most relations, there is a distinction between the nucleus and satellite of that relation: basically this is just like headedness in syntax. The nucleus contains the important stuff.

The relations in RST are given by communicative intent. This is a big separation between RST and other theories of discourse.

Here is an example of the evidence relation:

The program as published for the calendar year 1980 really works.
In only a few minutes, I entered all the figures from my 1980 tax return and got a result which agreed with my hand calculations to the penny.

Here, the first sentence is the nucleus and the second is the satellite. The evidence schema states (R=reader, W=writer, N=nucleus, S=satellite):

constraints on N: R might not believe N to a degree satisfactory to W
constraints on S: R believes S or will find it credible
constrains on N+S: R's comprehending S increases R's belief of N
the effect: R's belief of N is increased
locus of effect: N

Here's an example for concession:

Concern that this material is harmful to health or the environment may be misplaced.
Although it is toxic to certain animals,
evidence is lacking that it has any serious long-term effect on human beings.

Here, the 2nd EDU is a concession to the 3rd EDU (note that 2 and 3 are both in the same sentence). The concession schema looks like:

constraints on N: W has positive regard for the situation presented in N
constraints on S: W is not claiming that the situation presented in S doesn't hold
constraints on N+S: W acknowledges a potential or apparent incompatibility between the situations presented in N and S; W regards the situations presented in N and S as compatible; recognizing that the compatibility between the situations presented in S and S increases R's positive regard for the situation presented in N
the effect: R's positive regard for the situation presented in N is increased
locus of effect: N and S

Here is the original list of relations from the Mann+Thompson paper, though others have been added over time:

Circumstance
Solutionhood
Elaboration
Background
Enablement and Motivation
Evidence and Justify
Relations of Cause
Antithesis and Concession
Condition and Otherwise
Interpretation and Evaluation
Restatement and Summary
Sequence
Contrast

Spotting discourse effects is quite hard, which I have previously bemoaned. In general, there are two tasks: splitting sentences into EDUs and then doing discourse parsing. Unfortunately, the main foothold we have for both of these tasks are lexical cues. (For EDU splitting, embedded S-BARs often, though don't always, indicate a new EDU.)

For instance, concession is often identified by the word "although." And "Evidence" is often identified by "For instance." And "Elaboration" is often identified by "And." And so on.

One clever idea a few years ago was to try to mine lexical relations that are indicative of discourse structure. For example, I can find all sentences that begin "for example" and look at that sentence, and the preceding sentence. I can assume that this is an example of Evidence, and then look at features of those two sentences to try to figure out why this is an evidence relation. Then, in the future, when I see sentences that don't have this lexical cue, I can apply whatever I've learned.

The hope is that if you mine contrast relations, you can find contrasting pairs like love/hate or iPhone/Android or whatever. As was shown in the assigned paper for today, that didn't work particularly well. My feeling is that lexical information is not "deep" enough to really get you to discourse except in very simple cases (see that post I linked to before).

22 November 2010

P3 grading updated, deadline Wed 24th, 5p

Looks like I underestimated the difficulty of the gender classification. I've adjusted the scoring to be easier on you. The new scoring is:

35 < e < 37 : 10%
34 < e < 35 : 25%
33 < e < 34 : 32%
32 < e < 33 : 34%
31.5 < e < 32 : 36%
31 < e < 31.5 : 37%
30.5 < e < 31.0 : 38%
30 < e < 30.5 : 39%
e < 30: 40%

18 November 2010

Midterm solution is posted

See here.

Hw11 is posted

Lecture 22: Document Coherence

Documents are not just collections of sentences. Sentences serve purposes, and well-written document puts its sentences together well.

There are (at least) two notions of "well":

Coherence: This is what makes a text meaningful. This is the idea that the semantic interpretation of the sentences in a text should fit together into a larger picture.
Cohesion: This is what makes a text hang together. This is the idea that sentences help you understand the relationship between what came before and what is coming later.

Coherence includes things like anaphor and coreference, as well as things that tie the text together to the real world, like presuppositions and implications. Cohesion includes things like lexical repetition, topical repetitions, elipsis, etc.

This is a lot like syntax/semantics. A document can be cohesive but still make no sense. This is like syntax. On the other hand, a document can be meaningful, but still not feel like the sentences go together properly. This is like semantics.

TextTiling is an algorithm/model for discovering cohesive subparts (despite what the original paper says). This is basically a partitioning problem. We're given a text as a sequence of sentences, and we want to break it into pieces, each of which is internally cohesive. The algorithm looks roughly like:

For each sentence, collect all the word (stems) that occur in that sentence.
Define the similarity between any two token sequences (subsequences of the text) as the cosine similarity between their stem vectors.
For each position i in the text, compute the similarity between the block i-21:i-1 and i:i+20. (20 is arbitrary.)

We can now plot the similarities between blocks as we move through sentences, versus how people thing texts should be divided:

We then define the splits as the points where the similarities drop below some threshold.

One major (linguistic) weakness of this approach is its inability to handle synonymy, etc. When we talk about lexical repetition for cohesion, we also typically include things like "train / car / engine" even though their not the same word.

Argumentative Zoning is more of a coherence-type model (though it's not precisely clear that it's only in one of the two categories). It is primarily a model for research papers, where we talk about the role that each sentence plays in a paper. Like TextTiling, the model is also flat: a document is just a sequence of sentences and we're going to assign each sentence to a class. The three major classes are { Background, Other, Own }.

The original approach to AZ is to annotate data, and then train a classifier. The features used break down into a handful of categories:

explicit structure (where is this sentence in the document)
relative location (what %age through the document is this sentence)
citation (to self versus to others)
syntax (tense, aspect, voice, negation, etc.)
semantic (type of verb, etc.)
content (word overlap with title, etc.)

Overall, the results suggest that it's fairly easy to find OWN stuff, but relatively difficult to figure out what the other parts are.

There has been a lot of interesting work in this area since then, including things like sentiment analysis of citations, better features, better learning, etc...

16 November 2010

Lecture 21: Local Discourse Context

Coreference analysis means figuring out the meaning of pronouns in sentences like: "John saw him in the mirror" versus "John saw himself in the mirror." That's anaphor resolution. More generally, "John saw Bill and said 'My friend, you look lovely today.'" we need to figure out that My=John and friend=Bill.

Generally we care about three "mention" types:

Pronouns (he, she, it, I, etc...)
Proper nouns (John, Bill, Congress, the White House, etc...)
Common nouns (the soccer player, the man, etc.)

Not all pronouns are coreference, as in pleonastic it: "It is raining" but most are.

There are lots of signals of coreference:

Apposition: "John, mayor of Tinyville, is happy." Here, it's pretty clear that John=mayor
Syntactic structure with pronouns/-self. "John saw him" versus "... himself"
Gender / number / type agreement. "He" is probably not "Mary" and "him" is probably not "players". Similarly, "he" is probably not "the White House". (Mr. and Mrs. are helpful here.)
Locality. "He" probably refers back to a nearby referent, not something far away.
Discourse focus. New entities are likely to transition from being objects to being subjects, not the other way around.
Sub/super-string matching. "Bill Clinton" and "Clinton" are likely coreferent.
New entities are often introduced with indeterminant NPs ("a player") and then later with determinant NPs
World knowledge: sometimes we just know that Obama is the President

In general, matches between different mention types work as follows:

Named to named: very easy, 95% accuracy just using substring matching
Named to pronoun: pretty easy, 85-90% using Hobbs' heuristics, locality and gender
Named to common: very hard (often need world knowledge)
Common to common: very hard (often need lexical semantics)
Common to pronoun: don't bother -- do common to named and named to pronoun

Basic algorithms for coreference resolution treat it as a binary classification problem: are these two mentions coreferent or not. (That is, after we've done mention detection ala named entity recognition.) Some issues with this approach:

You then need to go back and fix things up to make sure that transitivity is enforced; or use left-to-right search
Number of -ve examples >> number of +ve examples; often subsample the negatives
These decisions aren't independent

Some things that make life difficult:

Metonymy: when we say "the White House" meaning "Obama"
Common nouns that aren't mentions of an entity: "This life is wonderful"
Quantified pronouns: "Every man loves his mother" (might or might not be coreferent)
World knowledge

My general feeling about world knowledge is that if it's commonly known it might not be stated (eg., we won't always see "Obama, President of the US, ...") but if it's not commonly known it will be made really obvious (eg., "Joe Shmoe, soccer player for the team"). We can, however, mine this common knowledge using the sorts of techniques we talked about last week.

15 November 2010

P2 is mostly graded...

If you'd like your partial score, please send me an email with the ID of the person who handed it in. I would love to just spam you all by sending your report to the hander-in at their id, but I don't know how to do that automatically. I tried sending an email to myid@umd.edu but it bounced... if anyone knows how to do that instead, let me know.

Overall people seemed to do quite well.

14 November 2010

TURKERS Final Project

Looks like there's no time tomorrow (Monday) that works for everyone. Please vote on your time preference, trying to be as flexible as possible, and figuring that we'll spend 30 minutes. Also, please list your email address in your signup so I can contact you! :)

11 November 2010

HW10 posted

and HW09 deadline pushed back to Sunday night

Lecture 20: World Knowledge from Text

Today we'll talk about the DIRT algorithm and related ideas. There are actually tons of names for problems very similar to what we're looking at: discovering inference rules, (automatic) paraphrasing, semantic parsing, etc. (The last one is a bit of a stretch, IMO.)

But the whole idea is to identify cases of substitutability. For example, knowing that "X wrote Y" means roughly the same thing as "X is the author of Y", we can potentially do a better job answering questions like "Who wrote the Star Spangled Banner?"

The main "magic" we'll employ to get this to work is the Distributional Hypothesis, which is one of those things that no one cites anymore because everyone knows it. (The citation is: Harris, Z. 1985. Distributional Structure. In: Katz, J. J. (ed.) The Philosophy of Linguistics. New York: Oxford University Press. pp. 26-47.... Zellig Harris is a famous linguist!) DH states that words in similar contexts tend to have similar meanings, and is often paraphrased as "you shall know a word by its friends."

Before talking about what DIRT does, we can talk about previous approaches that take a very "neighborly" approach to similarity.

For each word w in your vocabulary (eg., words that appear >= 100 times)
Find all contexts in which w appears, where a context is just two words to the left and two words to the right.
For each context, a b w c d, increment counters as follows:

c(w,a,left2) ++
c(w,b,left1) ++
c(w,c,right1) ++
c(w,d,right2) ++

Do this for all words and all counters
Now, we have a representation of each word in terms of its contexts
Define the similarity between two words w and w' as, for example, the cosine similarity between their context representations:

sim(w,w') = <c(w), c(w')> / ( ||c(w)|| ||c(w')|| )
where <x,y> means dot product, ||x|| means norm (= sqrt(sum_i x_i^2))
and c(w) is the context vector obtained by looking at all c(w,?,?)

Say that two words are similar if their similarity crosses some threshold

In DIRT, they use a syntactic notion of similarity, rather than a proximity notion of similarity. In order to do so, they leverage dependency trees as a representation of syntax. A dependency tree is another representation of syntax (that we didn't talk about) where you have words as nodes in a graph, and edges point from a word to its closest dependents. You can get from a constituency tree to a dependency tree by having each head (i.e., a V is the head of a VP) point to all of its arguments and adjuncts. But it's easier to get dependency trees directly (Google for MSTParser or MaltParser).

Once we have a dependency tree, the idea is very similar to what we talked about in semantic role labeling: pick two words in a sentence and look at the path that connects them. There's a good example in the paper on p2.

The DIRT algorithm now works by a process similar to the above.

For each dependency path p from w to w' in a sentence
Increment the count of (p, X, w) and (p, Y, w'), indicating that w is playing the role of X in "X wrote Y" and w' is playing the role of Y
Compute the similarity between any two paths by Eqs (1), (2) and (3) in the paper. Basically first compute similarities between slots and fillers, then between slots and slots, and then between paths and paths.
Do some heuristicy things to make it run efficiently.

09 November 2010

Correction: Bob Carpenter talk in AVW 3258, not whatever I said in class! (10a tomorrow)

Lecture 19: Structured Perceptron

Classification (eg., with decision trees or perceptron) is great when we're trying to predict a bit. It's not so great when we're trying to predict a structured object, like a POS sequence or parse tree or Arabic sentence (eg., in translation).

The field of predicting complicated structured outputs is called one of {structured prediction, structured output prediction, predicting structured outputs, etc}. I prefer the first term.

One of the simplest structured prediction problems is sequence labeling. I'm given a sequence of inputs and I want to predict a corresponding sequence of outputs, where there is a one-to-one matching between inputs and outputs. We've already seen POS tagging as an example of this task.

Another example of this task is named entity recognition. Here, the input is a sentence, and the goal is to spot all mentions of "named entities" in the text. Typically we also want to label them from a finite set { Person, Organization, Location, etc... }.

Since names can span more than one token (unlike POS tags), we need a way of encoding this as a sequence labeling problem. A popular encoding (though definitely not the only one) is the BIO encoding: Begin/In/Out. This is probably most clear by an example or two:

Hal_B-PER is_O at_O UMD_B-ORG today_O
Barack_B-PER Obama_I-PER is_O President_O of_O the_O United_B-GPE States_I-GPE of_I-GPE America_I-GPE
College_B-LOC Park_I-LOC Maryland_B-LOC is_O nice_O

Here, B-X signals the start of a phrase of type X, I-X continues it, and O means not-an-entity. Note in the last example that [College Park] and [Maryland] are different entities, so we have an I-LOC followed by a B-LOC. Note that it's impossible to get to I-X from anything other than B-X or I-X.

Obviously we could attempt to solve this problem with an HMM. The problem is that we might want to include a bunch of additional features about words, their contexts and their spellings, into the model. It's difficult, if not impossible, to do this with an HMM.

So instead, we will use a structured variant of a perceptron.

The high-level idea is fairly straightforward. Instead of extracting features over just the input, we'll extract features over the input and output together. Call this feature vector f(x,y), where x is a complete sentence and y is a complete output. For instance, we might have a feature: how many times is Maryland paired with B-LOC? Or how many times is a capitalized word paired with O? Or how many times did I see B-LOC followed by I-LOC?

The perceptron algorithm will hinge on being able to solve the "argmax" problem. Namely, for a given x (sentence) and given weight vector w, we want to find the y that maximizes the dot product between w and f(x,y). Note that in general this is VERY HARD. However, if all our features obey Markov assumptions, we can use dynamic programming.

Now we can sketch the algorithm:

Initialize w=<0 0 0 0 ... 0>
For each intput/output example (x,y)

Compute current prediction: y' = argmax_y w*f(x,y)
If y == y', celebrate
otherwise:

Update w = w + f(x,y) - f(x,y')

That's it!

All of the nice properties of perceptron are maintained for this algorithm. Plus, for sequence labeling, we can solve the argmax problem efficiently using a variant of the Viterbi algorithm.

If you want to know more about structured prediction, I'm giving a guest lecture in Lise Getoor's ML class on Nov 30, 3:30-4:45, in CSIC 1121.

08 November 2010

HW09 is posted

(Sorry, got tardy again!!!)

04 November 2010

Lecture 18: Linear Models for Learning

Decision trees are easy to understand, but they often die because they keep splitting up the training data.

Another idea is to think of a classification as being a weighted linear combination of features. In other words, each feature (word) gets a weight. Highly (positive) weighted words are really indicative of class 1, and highly negative weighted words are really indicative of class -1. We add up the weights for all the words that occur and use the sign of this as a prediction.

This is the idea behind the perceptron algorithm, which dates back to the 1960s. It is online, which means that it processes one example at a time. For each example, it makes a prediction. If this prediction is right, it does nothing. If it's wrong, it makes a small update:

For each example (x,y):

Compute prediction: y' = sum_{words in x} weight(word)
If y' = y, celebrate and continue
otherwise we've made an error:

For each word in x:

Set weight(word) := weight(word) + y
(In other words, if y is negative, reduce the weight; if y is positive, increase the weight)

That's it!

It turns out that if your data is such that the there is a set of weights that would enable us to get perfect classifications on the training data, the perceptron will eventually find it! (Of course by that time it may have overfit, so you'll probably want to stop it before it's converged.)

There is a linear algebra interpretation of the perceptron: it's optimizing a weight vector so that w*x is >0 for positive examples and w*x is <0 for negative examples. Here, "*" means "dot product." If you interpret dot products as projections, then it's trying to find a weight vector so that when the data points are projected onto this vector, the positive points all lie to one side, and the negative points all lie to the other side. (Note that the usual pictures you see for this are not representative of NLP applications, where our vectors are usually corners of a very high dimensional hypercube.)

Once you think of it like this, you can think about trying to directly optimize w accoring to some loss function, which leads to a whole host of models, like linear regression, ridge regression, logistic regression, support vector machines, boosting, etc., depending on how you choose your loss function.

The basic idea is to define a cost, something like "cost = 1 if the prediction is wrong, and cost = 0 if the prediction is right." Now, try to find a w that minimizes this cost. Unfortunately, this cost is discrete, which makes it hard (actually NP-hard) to optimize. So we relax it and say something like "cost gets bigger as the prediction gets more wrong" which is now continuous and easy to optimize.

The problem is that big weights are bad news: it means that our classification decision is really sensitive to just a few features. We'd like to keep the weights small. This thinking leads to regularization by penalizing large weights. The standard penalization is to penalize the sum of squared weights, basically because this is continuous and differentiable.

So we're left with a cost that looks like "sum of costs on training examples + lambda * regularizer", where lambda controls the overfitting/underfitting tradeoff.

P3 is posted, due Nov 23

02 November 2010

Lecture 17: Classification with Decision Trees

Machine learning is all about predicting the future given the past.

A particular type of machine learning problem that crops up in all sorts of NLP applications is classification. We get an object and try to put it into one of K buckets. For example, spam filtering (K=2) or sentiment detection (K=2) or news versus sports versus entertainment (K=3). When K=2 it's called binary classification. Otherwise, multiclass.

The general setup is:

Someone gives us a bunch of examples (x) together with their correct labels (y) -- this is our training data.
We learn some function h (for "hypothesis", maybe I'll call it f) that maps x to y
Then at "test time" someone gives us a bunch more (x) and we have to predict the corresponding (y)s
We are evaluated based on how accurate we were at this prediction

Key question #1: Why don't I just store the training data in a database and perform query lookups?

Key question #2: Why don't I just figure out what the most popular class (y) is and always report that?

Bad news: In general, in theory, learning is impossible.
Good news: In general, in practice, learning is EASY and works even when you might not expect it to.

Key concept is a feature vector. We take our object and extract features of that object, which is the representation that the learning will use. Good "feature engineering" is often worth more than good machine learning. (I say that even as a machine learning person!)

A very simple learning algorithm is a decision tree. It's like playing "20 questions." The idea is to pretend that you have data, and get to pick only one feature to look at. You pick that feature (let's pretend it's binary) and then split your data based on that feature. We now have two data sets. For each of these, we recurse and pick a new feature (it need not be the same on the two branches). At the end, we'll be down to a "leaf" (only one data point left) in which case we don't need to ask any more questions. (Or we'll have run out of features.)

Running the decision tree all the way to the leaves is like storing the examples in a database (why?). Running a zero-level decision tree is like remembering the most popular class (why?).

In general, we want something in between, and to choose the depth heuristically.

To do that, we'll set aside a bit of our training data as "development data" (or "validation data") in order to pick the "best" depth. (Or other stopping criteria.) Important point: why can't I just use the training data?

01 November 2010

Owen Rambow talk today IN AVW 3165!!!! 11am.

In the 80s and 90s of the last century, in subdisciplines such as
planning, text generation, and dialog systems, there was considerable
interest in modeling the cognitive states of interacting autonomous
agents. Theories such as Speech Act Theory (Austin 1962), the
belief-desire-intentions model of Bratman (1987), and Rhetorical
Structure Theory (Mann and Thompson 1988) together provide a framework
in which to link cognitive state with language use. However, in
general natural language processing (NLP), little use was made of such
theories, presumably because of the difficulty at the time of some
underlying tasks (such as syntactic parsing). In this talk, I propose
that it is time to again think about the explicit modeling of
cognitive state for participants in discourse. In fact, that is the
natural way to formulate what NLP is all about. The perspective of
cognitive state can provide a context in which many disparate NLP
tasks can be classified and related. I will present two NLP projects
at Columbia which relate to the modeling of cognitive state:

Discourse participants need to model each other's cognitive states,
and language makes this possible by providing special morphological,
syntactic, and lexical markers. I present results in automatically
determining the degree of belief of a speaker in the propositions in
his or her utterance.

Bio: PhD from University of Pennsylvania, 1994, working on German
syntax. My office mate was Philip Resnik. I worked at CoGentex, Inc (a
small company) and AT&T Labs -- Research until 2002, and since then at
Columbia as a Research Scientist. My research interests cover both the
nuts-and-bolts of languages, specifically syntax, and how language is
used in context.