UMD CMSC 723: Computational Linguistics I: 2010

22 December 2010

Summer internships at JHU/COE

... in case anyone is reading this, I just got the following email. I know this program and it's good.

Please share with graduate and undergraduate students looking for
summer internships.

The Johns Hopkins University Human Language Technology Center of
Excellence (COE) is seeking candidates for our summer internship
program as part of SCALE, our yearly summer workshop (Summer Camp for
Advanced Language Exploration.) Interns will work on research in
speech, text and graph processing as part of a larger workshop team.

Internship positions are fully funded, including travel, living
expenses and stipend.

This summer's workshop topic is "Combining Speech, Text and Graphs,"
which will draw from a number of research areas:

*** Natural Language Processing ***
- machine translation and NLP on informal texts
- opinion extraction from informal texts
- topic models
- information extraction
- domain adaptation and transfer learning
- multi-lingual learning

*** Speech ***
- topic detection
- keyword spotting
- resource constrained speech processing
- robust speech processing

*** Graphs ***
- finding communities in social networks
- anomaly detection
- learning on large scale graphs
- graph-based topic modeling

Candidates should be currently enrolled in an undergraduate or
graduate degree program. Applications submitted by Friday Jan 14, 2011
will receive full consideration.

For more information: http://web.jhu.edu/HLTCOE/scaleinterns2011.html

Applicants will be required to obtain a US security clearance, which
requires US citizenship.  If you do not already have a clearance, we
will work with you to obtain one.

21 December 2010

Grades (Almost) Posted, Semester Over

Hi all --

I figure you're likely to read to the end to find out about grades, so before I get to that, let me just take this chance to say that I really enjoyed this class this semester. You were all great. Everyone did awesome on both the final exam and final projects, and I'm really thrilled. If you couldn't tell already, I love language stuff and I encourage you all to continue on and learn everything there is to know. Philip is teaching the follow-on course in the Spring, which should be awesome. I'm also running an unofficial seminar on "Large Data" stuff in the spring; you can get more info here (sign up for the mailing list if you're interested). Anyway, I had a great time teaching; I hope you had fun in class.

Regarding grades, I need to submit them by midnight tonight. And since I don't plan on staying up until midnight, this really means tonight by about 11p.

I've posted "unofficial" grades on grades.cs.umd.edu, so you can see what your grade is. Note that the "total" column on that spreadsheet is completely misleading, since it doesn't include all the weirdo grading rules (dropping of worst projects/homeworks, inclusion of extra credit, etc.). I have all the numbers in a separate spreadsheet, so if something looks odd to you and you'd like the full computation, please let me know. It's of course possible to change grades later, but it's a pain, so I'd rather hear about any issues now.

That's it. Have a great break and I hope to see some of you again in the Spring!

-h

ps., per second comment below, I added an extra column, MyOverall. The grading is as follows:

98 = A+
95 = A
92 = A-
88 = B+
85 = B
82 = B-
78 = C+
75 = C
72 = C-

Note that your score will be exactly one of these numbers: This is just my way of encoding your grade. This isn't actually what your score was :).

14 December 2010

Happy Holidays: a few more P4 tokens!

12 December 2010

Interested in Language Science?

Just got the following email from Colin Philips in Linguistics / Cognitive Neuroscience. This is regarding language science. Please see below... feel free to email me if you have questions:

I'm hoping that you can help us to reach students in the CS/iSchool universe who might be interested in taking advantage of our unique interdisciplinary language science opportunities. We're particularly interested in reaching 1st and 2nd year students. We'll be holding an informational meeting for students tomorrow at 1pm in 1108B Marie Mount, but I'd be happy to meet at another time with anybody who is interested but not available at that time. We'll tell students about the opportunities and benefits, and also talk about the resources that are available to help them, including new plans to help them to develop interdisciplinary training plans that are both innovative and feasible. Csilla Kajtar already circulated a message about this, but we know that people often just ignore messages sent to mailing lists.

As you know, the closer integration of NLP and cognitive work in language is right at the top of our list of opportunities-that-we'd-be-idiots-not-to-pursue, and student training is one of the best ways to achieve this.

09 December 2010

Final Exam, Due Dec 17, 3:30pm

Here's a copy of the final exam as well as the source LaTeX. Please feel free to either print it and do it by hand, or to do it in LaTeX and print the solution. You may turn it in any time between now and 3:30pm on Dec 17. (Because our official exam time is 1:30-3:30 on Dec 17.) Please hand it in in one of three ways: (1) give it to me in person in my office or otherwise; (2) slide it completely under my office door (AVW 3227); (3) give it to Amit in person.

If you have any clarification questions, please post them here.

06 December 2010

P4 deadline pushed back to Dec 14

The 9th is apparently the deadline for the ML project.

05 December 2010

P4: Small error in example numbers....

There are some sentences in the training data that contain a "TAB" character. The reasonable thing to do would be just to consider this as whitespace. For some reason I didn't do this. In my example of DF computation, I did this. Which somewhat changes all the remaining numbers.

Instead of rerunning everything I'll just tell you what the updated top frequency words are if you do it "properly." In general, for this assignment, don't worry if your numbers are slightly different than mine -- it may have to do with how you handle the non-ascii characters that appear once in a while in the data.

Last seminar of the semester: Michael Paul Dec 8, 11am

December 8: Michael Paul: Summarizing Contrastive Viewpoints in Opinionated Text

AVW 2120

Performing multi-document summarization of opinionated text has unique
challenges because it is important to recognize that the same
information may be presented in different ways from different
viewpoints. In this talk, we will present a special kind of
contrastive summarization approach intended to highlight this
phenomenon and to help users digest conflicting opinions. To do this,
we introduce a new graph-based algorithm, Comparative LexRank, to
score sentences in a summary based on a combination of both
representativeness of the collection and comparability between
opposing viewpoints. We then address the issue of how to automatically
discover and extract viewpoints from unlabeled text, and we experiment
with a novel two-dimensional topic model for the task of unsupervised
clustering of documents by viewpoint. Finally, we discuss how these
two stages can be combined to both automatically extract and summarize
viewpoints in an interesting way. Results are presented on two
political opinion data sets.
This project was joint work with ChengXiang Zhai and Roxana Girju.


Bio: Michael Paul is a first-year Ph.D. student of Computer Science at
the Johns Hopkins University and a member of the Center for Language
and Speech Processing. He earned a B.S. from the University of
Illinois at Urbana-Champaign in 2009. He is currently a Graduate
Research Fellow of the National Science Foundation and a Dean's Fellow
of the Whiting School of Engineering.

02 December 2010

Lecture 25: Mapping Text to Actions

There has been a bunch of work recently on trying to automatically find relationships between language and the "real world", where "real world" actually often means some sort of simulated environment. Here are a few papers along these lines:

Fleischman, M. B. and Roy, D. Intentional Context in Situated Language Learning. Ninth Conference on Computational Natural Language Learning , Ann Arbor, MI. June 2005.
Learning to Connect Language and Perception [Abstract] [PDF]
Raymond J. Mooney
In Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI), Senior Member Paper, Chicago, IL, pp. 1598-1601, July 2008.
S.R.K. Branavan, Harr Chen, Luke Zettlemoyer and Regina Barzilay
"Reinforcement Learning for Mapping Instructions to Actions",
Proceedings of ACL, 2009. Best Paper Award
Learning semantic correspondences with less supervision.
Percy Liang, Michael I. Jordan, Dan Klein.
Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL-IJCNLP), 2009.
Adam Vogel and Dan Jurafsky. 2010. Learning to Follow Navigational Directions. In Proceedings of ACL-2010, Uppsala, Sweden. [PDF]

There are others, of course, but these five form a fairly diverse example set. There's not much work on trying to use the real world, but robotics people like Nick Roy (at MIT) are trying to make headway on this problem.

In the first paper, which is the one we'll talk about most, the key idea is that of hierarchical plans, represented as a pcfg. For instance we might have a rule "OfferCup -> PickUpCup MoveCup ReleaseCup", where each of the subactions might either be atomic (correspond to actual muscle movements) or might itself be broken down further. (Qustion: how context free is this problem?)

The key ambiguity is due to the fact that actions do not select for exactly one interpretation, as in the Blicket example.

In this paper, they hand constructed a PCFG for actions and the key learning question was whether you could figure out the level of ambiguity automatically. The basic idea is to look at relative frequencies of occurance between lexical items and nodes in the PCFG tree for the actions.

01 December 2010

P4, grading rules

So P4 has been posted for a while. It is "optional" in the sense that your project grades will be based on your best three out of four grades. In particular, here's what will happen.

Suppose that your grades on P1, ..., P4 are a,b,c,d. (If you don't do P4, then d=0.)

Let x = [ (a + b + c + d) - min { a, b, c, d } ] / 3

Then x is your average grade on your best three projects.

We will use x as your overall project grade (i.e., since each project is weighed equally, it will be like you got a score of x on all FOUR of them).

30 November 2010

Lecture 24: Information Extraction

Information extraction is, roughly, the task of going from unstructured text (aka text) to structured data. Think of it as mapping language to a database.

One of the more famous IE tasks is identifying terrorist events (specifically South American terrorist events) in documents. For each event, we have to identify the victim(s), date, type of event (bombing, etc.), culprits, and so on. These fields define our extraction template and our goal is to fill it up based on a document. Of course some documents mention no terrorist events and some mention multiple. And not all fields will be mentioned. This data was available for the MUC (message understanding conference) competitions two decades ago and current approaches still only get about 50% accuracy!

One way of going about this problem is as a sequence labeling task, akin to NER. Since you can imagine how this works, we'll talk about the other style of approach: pattern-based methods.

The idea of pattern-based approaches is that our system consists of a collection of extraction patterns, of the form "<Subj> was assassinated" => Victim. These lexico-syntactic patterns tell us how and when to extract certain slots (aka fields) from text. The expressiveness of patterns depends entirely on how much preprocessing you want to do, but usually some sort of syntactic processing is assumed.

The key question is: where do these patterns come from? The trend in IE has been to move toward mostly unsupervised approaches that don't need large amounts of training data. Successful approaches are akin to the Yarowsky algorithm for WSD.

Suppose we had labeled data, where for each document we have one (for simplicity) partially filled template. We can go into the document and find occurrences of the strings in that template as the things we want to extract. For each of these, we can find a bunch of extraction patterns that would potentially extract that string and collect them over the whole data set. We now need to find the "best" ones. A common metric is the "R log F" metric, which is simply the probability that a given pattern extracts the right slot, times log of the frequency of that pattern. The "log F" term is in there because we want to make sure that we get good coverage.

Of course, you needn't start with labeled data. You can start with small lists of slot fillers (eg., Al Queda as a perpetrator and so on) and bootstrap away. As always, the quality of your seeds directly affects how well your algorithm works.

One can get even more unsupervised by doing the following. Take a collection of documents that talk about terrorist events, and a collection of documents that don't. Look for patterns in the terrorist events collection that are significantly more common there, than in the other collection. Rank these by something like "R log F". The top patterns there are often very good extraction patterns, but we don't know what they are supposed to extract. Have a human look down the list of the top 100 and viola, you're done, and it only takes a few minutes.

Most approaches to IE fall into one of these two camps: sequence labeling or pattern based approaches. It seems that sequence labeling approaches work well when most of the words in the text are extracted for something (i.e., turning free text citations into bibtex entries), but pattern based approaches work well for needle in a haystack problems.

There have been a few recent trends in IE:

Using Wikipedia infoboxes as training data
Trying to extract knowledge without pre-defined templates (akin to our discussion of mining world knowledge)

23 November 2010

Lecture 23: Rhetorical Structure Theory

So far we've seen flat representations of discourse. RST is an example of a hierarchical discourse representation. Such as:

Here, we've broken some imagined text into 7 "units" and depicted the role of these units in the text. As is implied by the title, this theory is mostly applicable to rhetoric, which is essentially persuasive language.

An RST structure is essentially a dependency tree over elementary discourse units (EDUs), where relations on the edges of the tree tell us the relationship between two EDUs. In the above example, we're saying that EDUs 1-3 provide background to EDUs 4-7. And so on down the tree.

For most relations, there is a distinction between the nucleus and satellite of that relation: basically this is just like headedness in syntax. The nucleus contains the important stuff.

The relations in RST are given by communicative intent. This is a big separation between RST and other theories of discourse.

Here is an example of the evidence relation:

The program as published for the calendar year 1980 really works.
In only a few minutes, I entered all the figures from my 1980 tax return and got a result which agreed with my hand calculations to the penny.

Here, the first sentence is the nucleus and the second is the satellite. The evidence schema states (R=reader, W=writer, N=nucleus, S=satellite):

constraints on N: R might not believe N to a degree satisfactory to W
constraints on S: R believes S or will find it credible
constrains on N+S: R's comprehending S increases R's belief of N
the effect: R's belief of N is increased
locus of effect: N

Here's an example for concession:

Concern that this material is harmful to health or the environment may be misplaced.
Although it is toxic to certain animals,
evidence is lacking that it has any serious long-term effect on human beings.

Here, the 2nd EDU is a concession to the 3rd EDU (note that 2 and 3 are both in the same sentence). The concession schema looks like:

constraints on N: W has positive regard for the situation presented in N
constraints on S: W is not claiming that the situation presented in S doesn't hold
constraints on N+S: W acknowledges a potential or apparent incompatibility between the situations presented in N and S; W regards the situations presented in N and S as compatible; recognizing that the compatibility between the situations presented in S and S increases R's positive regard for the situation presented in N
the effect: R's positive regard for the situation presented in N is increased
locus of effect: N and S

Here is the original list of relations from the Mann+Thompson paper, though others have been added over time:

Circumstance
Solutionhood
Elaboration
Background
Enablement and Motivation
Evidence and Justify
Relations of Cause
Antithesis and Concession
Condition and Otherwise
Interpretation and Evaluation
Restatement and Summary
Sequence
Contrast

Spotting discourse effects is quite hard, which I have previously bemoaned. In general, there are two tasks: splitting sentences into EDUs and then doing discourse parsing. Unfortunately, the main foothold we have for both of these tasks are lexical cues. (For EDU splitting, embedded S-BARs often, though don't always, indicate a new EDU.)

For instance, concession is often identified by the word "although." And "Evidence" is often identified by "For instance." And "Elaboration" is often identified by "And." And so on.

One clever idea a few years ago was to try to mine lexical relations that are indicative of discourse structure. For example, I can find all sentences that begin "for example" and look at that sentence, and the preceding sentence. I can assume that this is an example of Evidence, and then look at features of those two sentences to try to figure out why this is an evidence relation. Then, in the future, when I see sentences that don't have this lexical cue, I can apply whatever I've learned.

The hope is that if you mine contrast relations, you can find contrasting pairs like love/hate or iPhone/Android or whatever. As was shown in the assigned paper for today, that didn't work particularly well. My feeling is that lexical information is not "deep" enough to really get you to discourse except in very simple cases (see that post I linked to before).

22 November 2010

P3 grading updated, deadline Wed 24th, 5p

Looks like I underestimated the difficulty of the gender classification. I've adjusted the scoring to be easier on you. The new scoring is:

35 < e < 37 : 10%
34 < e < 35 : 25%
33 < e < 34 : 32%
32 < e < 33 : 34%
31.5 < e < 32 : 36%
31 < e < 31.5 : 37%
30.5 < e < 31.0 : 38%
30 < e < 30.5 : 39%
e < 30: 40%

18 November 2010

Midterm solution is posted

See here.

Hw11 is posted

Lecture 22: Document Coherence

Documents are not just collections of sentences. Sentences serve purposes, and well-written document puts its sentences together well.

There are (at least) two notions of "well":

Coherence: This is what makes a text meaningful. This is the idea that the semantic interpretation of the sentences in a text should fit together into a larger picture.
Cohesion: This is what makes a text hang together. This is the idea that sentences help you understand the relationship between what came before and what is coming later.

Coherence includes things like anaphor and coreference, as well as things that tie the text together to the real world, like presuppositions and implications. Cohesion includes things like lexical repetition, topical repetitions, elipsis, etc.

This is a lot like syntax/semantics. A document can be cohesive but still make no sense. This is like syntax. On the other hand, a document can be meaningful, but still not feel like the sentences go together properly. This is like semantics.

TextTiling is an algorithm/model for discovering cohesive subparts (despite what the original paper says). This is basically a partitioning problem. We're given a text as a sequence of sentences, and we want to break it into pieces, each of which is internally cohesive. The algorithm looks roughly like:

For each sentence, collect all the word (stems) that occur in that sentence.
Define the similarity between any two token sequences (subsequences of the text) as the cosine similarity between their stem vectors.
For each position i in the text, compute the similarity between the block i-21:i-1 and i:i+20. (20 is arbitrary.)

We can now plot the similarities between blocks as we move through sentences, versus how people thing texts should be divided:

We then define the splits as the points where the similarities drop below some threshold.

One major (linguistic) weakness of this approach is its inability to handle synonymy, etc. When we talk about lexical repetition for cohesion, we also typically include things like "train / car / engine" even though their not the same word.

Argumentative Zoning is more of a coherence-type model (though it's not precisely clear that it's only in one of the two categories). It is primarily a model for research papers, where we talk about the role that each sentence plays in a paper. Like TextTiling, the model is also flat: a document is just a sequence of sentences and we're going to assign each sentence to a class. The three major classes are { Background, Other, Own }.

The original approach to AZ is to annotate data, and then train a classifier. The features used break down into a handful of categories:

explicit structure (where is this sentence in the document)
relative location (what %age through the document is this sentence)
citation (to self versus to others)
syntax (tense, aspect, voice, negation, etc.)
semantic (type of verb, etc.)
content (word overlap with title, etc.)

Overall, the results suggest that it's fairly easy to find OWN stuff, but relatively difficult to figure out what the other parts are.

There has been a lot of interesting work in this area since then, including things like sentiment analysis of citations, better features, better learning, etc...

16 November 2010

Lecture 21: Local Discourse Context

Coreference analysis means figuring out the meaning of pronouns in sentences like: "John saw him in the mirror" versus "John saw himself in the mirror." That's anaphor resolution. More generally, "John saw Bill and said 'My friend, you look lovely today.'" we need to figure out that My=John and friend=Bill.

Generally we care about three "mention" types:

Pronouns (he, she, it, I, etc...)
Proper nouns (John, Bill, Congress, the White House, etc...)
Common nouns (the soccer player, the man, etc.)

Not all pronouns are coreference, as in pleonastic it: "It is raining" but most are.

There are lots of signals of coreference:

Apposition: "John, mayor of Tinyville, is happy." Here, it's pretty clear that John=mayor
Syntactic structure with pronouns/-self. "John saw him" versus "... himself"
Gender / number / type agreement. "He" is probably not "Mary" and "him" is probably not "players". Similarly, "he" is probably not "the White House". (Mr. and Mrs. are helpful here.)
Locality. "He" probably refers back to a nearby referent, not something far away.
Discourse focus. New entities are likely to transition from being objects to being subjects, not the other way around.
Sub/super-string matching. "Bill Clinton" and "Clinton" are likely coreferent.
New entities are often introduced with indeterminant NPs ("a player") and then later with determinant NPs
World knowledge: sometimes we just know that Obama is the President

In general, matches between different mention types work as follows:

Named to named: very easy, 95% accuracy just using substring matching
Named to pronoun: pretty easy, 85-90% using Hobbs' heuristics, locality and gender
Named to common: very hard (often need world knowledge)
Common to common: very hard (often need lexical semantics)
Common to pronoun: don't bother -- do common to named and named to pronoun

Basic algorithms for coreference resolution treat it as a binary classification problem: are these two mentions coreferent or not. (That is, after we've done mention detection ala named entity recognition.) Some issues with this approach:

You then need to go back and fix things up to make sure that transitivity is enforced; or use left-to-right search
Number of -ve examples >> number of +ve examples; often subsample the negatives
These decisions aren't independent

Some things that make life difficult:

Metonymy: when we say "the White House" meaning "Obama"
Common nouns that aren't mentions of an entity: "This life is wonderful"
Quantified pronouns: "Every man loves his mother" (might or might not be coreferent)
World knowledge

My general feeling about world knowledge is that if it's commonly known it might not be stated (eg., we won't always see "Obama, President of the US, ...") but if it's not commonly known it will be made really obvious (eg., "Joe Shmoe, soccer player for the team"). We can, however, mine this common knowledge using the sorts of techniques we talked about last week.

15 November 2010

P2 is mostly graded...

If you'd like your partial score, please send me an email with the ID of the person who handed it in. I would love to just spam you all by sending your report to the hander-in at their id, but I don't know how to do that automatically. I tried sending an email to myid@umd.edu but it bounced... if anyone knows how to do that instead, let me know.

Overall people seemed to do quite well.

14 November 2010

TURKERS Final Project

Looks like there's no time tomorrow (Monday) that works for everyone. Please vote on your time preference, trying to be as flexible as possible, and figuring that we'll spend 30 minutes. Also, please list your email address in your signup so I can contact you! :)

11 November 2010

HW10 posted

and HW09 deadline pushed back to Sunday night

Lecture 20: World Knowledge from Text

Today we'll talk about the DIRT algorithm and related ideas. There are actually tons of names for problems very similar to what we're looking at: discovering inference rules, (automatic) paraphrasing, semantic parsing, etc. (The last one is a bit of a stretch, IMO.)

But the whole idea is to identify cases of substitutability. For example, knowing that "X wrote Y" means roughly the same thing as "X is the author of Y", we can potentially do a better job answering questions like "Who wrote the Star Spangled Banner?"

The main "magic" we'll employ to get this to work is the Distributional Hypothesis, which is one of those things that no one cites anymore because everyone knows it. (The citation is: Harris, Z. 1985. Distributional Structure. In: Katz, J. J. (ed.) The Philosophy of Linguistics. New York: Oxford University Press. pp. 26-47.... Zellig Harris is a famous linguist!) DH states that words in similar contexts tend to have similar meanings, and is often paraphrased as "you shall know a word by its friends."

Before talking about what DIRT does, we can talk about previous approaches that take a very "neighborly" approach to similarity.

For each word w in your vocabulary (eg., words that appear >= 100 times)
Find all contexts in which w appears, where a context is just two words to the left and two words to the right.
For each context, a b w c d, increment counters as follows:

c(w,a,left2) ++
c(w,b,left1) ++
c(w,c,right1) ++
c(w,d,right2) ++

Do this for all words and all counters
Now, we have a representation of each word in terms of its contexts
Define the similarity between two words w and w' as, for example, the cosine similarity between their context representations:

sim(w,w') = <c(w), c(w')> / ( ||c(w)|| ||c(w')|| )
where <x,y> means dot product, ||x|| means norm (= sqrt(sum_i x_i^2))
and c(w) is the context vector obtained by looking at all c(w,?,?)

Say that two words are similar if their similarity crosses some threshold

In DIRT, they use a syntactic notion of similarity, rather than a proximity notion of similarity. In order to do so, they leverage dependency trees as a representation of syntax. A dependency tree is another representation of syntax (that we didn't talk about) where you have words as nodes in a graph, and edges point from a word to its closest dependents. You can get from a constituency tree to a dependency tree by having each head (i.e., a V is the head of a VP) point to all of its arguments and adjuncts. But it's easier to get dependency trees directly (Google for MSTParser or MaltParser).

Once we have a dependency tree, the idea is very similar to what we talked about in semantic role labeling: pick two words in a sentence and look at the path that connects them. There's a good example in the paper on p2.

The DIRT algorithm now works by a process similar to the above.

For each dependency path p from w to w' in a sentence
Increment the count of (p, X, w) and (p, Y, w'), indicating that w is playing the role of X in "X wrote Y" and w' is playing the role of Y
Compute the similarity between any two paths by Eqs (1), (2) and (3) in the paper. Basically first compute similarities between slots and fillers, then between slots and slots, and then between paths and paths.
Do some heuristicy things to make it run efficiently.

09 November 2010

Correction: Bob Carpenter talk in AVW 3258, not whatever I said in class! (10a tomorrow)

Lecture 19: Structured Perceptron

Classification (eg., with decision trees or perceptron) is great when we're trying to predict a bit. It's not so great when we're trying to predict a structured object, like a POS sequence or parse tree or Arabic sentence (eg., in translation).

The field of predicting complicated structured outputs is called one of {structured prediction, structured output prediction, predicting structured outputs, etc}. I prefer the first term.

One of the simplest structured prediction problems is sequence labeling. I'm given a sequence of inputs and I want to predict a corresponding sequence of outputs, where there is a one-to-one matching between inputs and outputs. We've already seen POS tagging as an example of this task.

Another example of this task is named entity recognition. Here, the input is a sentence, and the goal is to spot all mentions of "named entities" in the text. Typically we also want to label them from a finite set { Person, Organization, Location, etc... }.

Since names can span more than one token (unlike POS tags), we need a way of encoding this as a sequence labeling problem. A popular encoding (though definitely not the only one) is the BIO encoding: Begin/In/Out. This is probably most clear by an example or two:

Hal_B-PER is_O at_O UMD_B-ORG today_O
Barack_B-PER Obama_I-PER is_O President_O of_O the_O United_B-GPE States_I-GPE of_I-GPE America_I-GPE
College_B-LOC Park_I-LOC Maryland_B-LOC is_O nice_O

Here, B-X signals the start of a phrase of type X, I-X continues it, and O means not-an-entity. Note in the last example that [College Park] and [Maryland] are different entities, so we have an I-LOC followed by a B-LOC. Note that it's impossible to get to I-X from anything other than B-X or I-X.

Obviously we could attempt to solve this problem with an HMM. The problem is that we might want to include a bunch of additional features about words, their contexts and their spellings, into the model. It's difficult, if not impossible, to do this with an HMM.

So instead, we will use a structured variant of a perceptron.

The high-level idea is fairly straightforward. Instead of extracting features over just the input, we'll extract features over the input and output together. Call this feature vector f(x,y), where x is a complete sentence and y is a complete output. For instance, we might have a feature: how many times is Maryland paired with B-LOC? Or how many times is a capitalized word paired with O? Or how many times did I see B-LOC followed by I-LOC?

The perceptron algorithm will hinge on being able to solve the "argmax" problem. Namely, for a given x (sentence) and given weight vector w, we want to find the y that maximizes the dot product between w and f(x,y). Note that in general this is VERY HARD. However, if all our features obey Markov assumptions, we can use dynamic programming.

Now we can sketch the algorithm:

Initialize w=<0 0 0 0 ... 0>
For each intput/output example (x,y)

Compute current prediction: y' = argmax_y w*f(x,y)
If y == y', celebrate
otherwise:

Update w = w + f(x,y) - f(x,y')

That's it!

All of the nice properties of perceptron are maintained for this algorithm. Plus, for sequence labeling, we can solve the argmax problem efficiently using a variant of the Viterbi algorithm.

If you want to know more about structured prediction, I'm giving a guest lecture in Lise Getoor's ML class on Nov 30, 3:30-4:45, in CSIC 1121.

08 November 2010

HW09 is posted

(Sorry, got tardy again!!!)

04 November 2010

Lecture 18: Linear Models for Learning

Decision trees are easy to understand, but they often die because they keep splitting up the training data.

Another idea is to think of a classification as being a weighted linear combination of features. In other words, each feature (word) gets a weight. Highly (positive) weighted words are really indicative of class 1, and highly negative weighted words are really indicative of class -1. We add up the weights for all the words that occur and use the sign of this as a prediction.

This is the idea behind the perceptron algorithm, which dates back to the 1960s. It is online, which means that it processes one example at a time. For each example, it makes a prediction. If this prediction is right, it does nothing. If it's wrong, it makes a small update:

For each example (x,y):

Compute prediction: y' = sum_{words in x} weight(word)
If y' = y, celebrate and continue
otherwise we've made an error:

For each word in x:

Set weight(word) := weight(word) + y
(In other words, if y is negative, reduce the weight; if y is positive, increase the weight)

That's it!

It turns out that if your data is such that the there is a set of weights that would enable us to get perfect classifications on the training data, the perceptron will eventually find it! (Of course by that time it may have overfit, so you'll probably want to stop it before it's converged.)

There is a linear algebra interpretation of the perceptron: it's optimizing a weight vector so that w*x is >0 for positive examples and w*x is <0 for negative examples. Here, "*" means "dot product." If you interpret dot products as projections, then it's trying to find a weight vector so that when the data points are projected onto this vector, the positive points all lie to one side, and the negative points all lie to the other side. (Note that the usual pictures you see for this are not representative of NLP applications, where our vectors are usually corners of a very high dimensional hypercube.)

Once you think of it like this, you can think about trying to directly optimize w accoring to some loss function, which leads to a whole host of models, like linear regression, ridge regression, logistic regression, support vector machines, boosting, etc., depending on how you choose your loss function.

The basic idea is to define a cost, something like "cost = 1 if the prediction is wrong, and cost = 0 if the prediction is right." Now, try to find a w that minimizes this cost. Unfortunately, this cost is discrete, which makes it hard (actually NP-hard) to optimize. So we relax it and say something like "cost gets bigger as the prediction gets more wrong" which is now continuous and easy to optimize.

The problem is that big weights are bad news: it means that our classification decision is really sensitive to just a few features. We'd like to keep the weights small. This thinking leads to regularization by penalizing large weights. The standard penalization is to penalize the sum of squared weights, basically because this is continuous and differentiable.

So we're left with a cost that looks like "sum of costs on training examples + lambda * regularizer", where lambda controls the overfitting/underfitting tradeoff.

P3 is posted, due Nov 23

02 November 2010

Lecture 17: Classification with Decision Trees

Machine learning is all about predicting the future given the past.

A particular type of machine learning problem that crops up in all sorts of NLP applications is classification. We get an object and try to put it into one of K buckets. For example, spam filtering (K=2) or sentiment detection (K=2) or news versus sports versus entertainment (K=3). When K=2 it's called binary classification. Otherwise, multiclass.

The general setup is:

Someone gives us a bunch of examples (x) together with their correct labels (y) -- this is our training data.
We learn some function h (for "hypothesis", maybe I'll call it f) that maps x to y
Then at "test time" someone gives us a bunch more (x) and we have to predict the corresponding (y)s
We are evaluated based on how accurate we were at this prediction

Key question #1: Why don't I just store the training data in a database and perform query lookups?

Key question #2: Why don't I just figure out what the most popular class (y) is and always report that?

Bad news: In general, in theory, learning is impossible.
Good news: In general, in practice, learning is EASY and works even when you might not expect it to.

Key concept is a feature vector. We take our object and extract features of that object, which is the representation that the learning will use. Good "feature engineering" is often worth more than good machine learning. (I say that even as a machine learning person!)

A very simple learning algorithm is a decision tree. It's like playing "20 questions." The idea is to pretend that you have data, and get to pick only one feature to look at. You pick that feature (let's pretend it's binary) and then split your data based on that feature. We now have two data sets. For each of these, we recurse and pick a new feature (it need not be the same on the two branches). At the end, we'll be down to a "leaf" (only one data point left) in which case we don't need to ask any more questions. (Or we'll have run out of features.)

Running the decision tree all the way to the leaves is like storing the examples in a database (why?). Running a zero-level decision tree is like remembering the most popular class (why?).

In general, we want something in between, and to choose the depth heuristically.

To do that, we'll set aside a bit of our training data as "development data" (or "validation data") in order to pick the "best" depth. (Or other stopping criteria.) Important point: why can't I just use the training data?

01 November 2010

Owen Rambow talk today IN AVW 3165!!!! 11am.

In the 80s and 90s of the last century, in subdisciplines such as
planning, text generation, and dialog systems, there was considerable
interest in modeling the cognitive states of interacting autonomous
agents. Theories such as Speech Act Theory (Austin 1962), the
belief-desire-intentions model of Bratman (1987), and Rhetorical
Structure Theory (Mann and Thompson 1988) together provide a framework
in which to link cognitive state with language use. However, in
general natural language processing (NLP), little use was made of such
theories, presumably because of the difficulty at the time of some
underlying tasks (such as syntactic parsing). In this talk, I propose
that it is time to again think about the explicit modeling of
cognitive state for participants in discourse. In fact, that is the
natural way to formulate what NLP is all about. The perspective of
cognitive state can provide a context in which many disparate NLP
tasks can be classified and related. I will present two NLP projects
at Columbia which relate to the modeling of cognitive state:

Discourse participants need to model each other's cognitive states,
and language makes this possible by providing special morphological,
syntactic, and lexical markers. I present results in automatically
determining the degree of belief of a speaker in the propositions in
his or her utterance.

Bio: PhD from University of Pennsylvania, 1994, working on German
syntax. My office mate was Philip Resnik. I worked at CoGentex, Inc (a
small company) and AT&T Labs -- Research until 2002, and since then at
Columbia as a Research Scientist. My research interests cover both the
nuts-and-bolts of languages, specifically syntax, and how language is
used in context.

29 October 2010

HW08 is posted

28 October 2010

Lecture 16: Computational Semantics

A recent development in NLP-land is the interest in "semantic role labeling." I tend to think of this as "Parsing++", but it's basically a halfway point between "syntax" (which we've already passed a bit) and "semantics" (which we don't know how to do).

The idea in SRL is to find predicates (mostly verbs) and identify the arguments of these verbs (eg., Agents, Patients, Experiencers, Themes, etc.). For example:

John ate a sandwich

For the predicate ate, John is the agent and sandwich is the patient

A sandwich was eaten by John

Same as above.

Bill saw John eat a sandwich

For the predicate ate, it's the same as above
For the predicate saw, the agent is Bill

There were basically two initial attempts to annotate data with semantic roles: PropBank and FrameNet. The biggest difference between the two is that FrameNet used a consistent labeling of arguments across verbs, whereas PropBank just called things "Arg0", "Arg1", ... "ArgN", where "Arg0" for one verb might have nothing to do with "Arg0" for another verb.

The general approach to doing semantic role labeling is a pipeline of "first parse, second find verbs, third find the arguments to those verbs." (There's some argument as to whether parsing is strictly necessary, but it's clearly at least helpful.)

The biggest challenge here is finding the arguments. Basically we'll go over the parse tree, find potential arguments (basically internal nodes in the tree), and then try to use statistical information to figure out which nodes correspond to which arguments (or none).

John saw a man with a telescope.

For the "John was using the telescope" interpretation, the predicate saw has an Agent (John), a Patient (man) and a Means (telescope).

First, we'll identify all potential argument candidates:

John
a
man
a man
with
a
telescope
a telescope
with a telescope

We might initially filter some of these out based on label (eg., DTs are unlikely to be arguments). Maybe this leaves us with:

John
man
a man
telescope
a telescope
with a telescope

Now, we need to figure out which (if any) of these maps to one of the given semantic roles. A baseline approach is to estimate something like P("telescope" is the means | verb is saw), but this is very brittle because it depends on "bilexical dependencies".

A more sane approach is to use instead (or in addition) tree paths. This idea has come up in a lot of places, and often works pretty well. If we look at the parse of this sentence:

(S (NP John) (VP (V saw) (NP (DT a) (NN man)) (PP (P with) (NP (DT a) (NN telescope)))))

The we can ask ourselves how we can get from the "saw" node to the "telescope" node. In this case, we go up to the VP, then down to the PP, then down to the NP then down to the NN.

There are lots of ways to represent this path, but a baseline might be to say that the path looks like:

up:VP down:3:PP down:2:NP down:2:NN

Here, the numbers represent the child number. (Perhaps we'd want those on the "up" as well so we can remember where we were. Alternatively, we might encode up-left and up-right, since order matters a lot in English.)

However we encode the path, we now need to estimate something like P(means | path). A naive approach would be to Bayes' rule this to get P(means) * P(path | means) and then we could model the path with a Markov model. Or, if you've taken machine learning, you can use the path as a collection of features in some learning algorithm (or if you're being fancy, turn it into a kernel!). We'll talk about this more when we talk about learning.

As a final note, Matthew Gerber and Joyce Chai got best paper at ACL 2010 for their paper Beyond NomBank: A Study of Implicit Arguments for Nominal Predicates, where they look at similar things for nominal predicates rather than verbal predicates. In this case, I agree: this is a great paper.

26 October 2010

Midterm is Posted

here...

Lecture 15: Computational Lexical Semantics

Today's topic is word sense disambiguation. This is a very "heated" topic and we'll discuss in class (not on the public blog!) why this might be!

Here are some examples of the verb "drive"... which are different senses?

Can you drive this four-wheel truck?
We drove to the university every morning.
She drove me to school every day.
He drives me mad.
She is driven by her passion.
They drove back the invadors.
She finally drove him to change jobs.
He drove a nail into the wall.
The player drove the ball far out into the field.
She is driving away at her doctoral thesis.
What are you driving at?
My new truck drives well.
He drives for the taxi company in DC.
The car drove around the corner.
The farmers drove the cows into the barn.
We drive the beltway to work.
He drove a golf ball across the street.
I hit the ball with a bat and drove it into left field.
The workers drove a tunnel through the hill.
These engines are driven by steam.
The hunter drove the forest looking for deer.
After finding deer, he drove it into open ground.

We'll talk about this example after you've clustered these examples into senses.

Two of the most important concepts in WSD are:

One sense per discourse (which may or may not be true...)
One sense per collocation (which also may or may not be true...)

The Yarowsky algorithm is an "unsupervised" algorithm (i.e., doesn't really require people to label data) for doing WSD that has inspired a ton of other work in WSD as well as other areas (including some of my own!!!). Wikipedia has a good description of the algorithm if you don't like the one in the book; we'll talk about how and why it works in class.

21 October 2010

HW07 is posted

(I'm trying to get better about posting these early, like in the beginning of the semester when I wasn't as behind schedule on everything!)

Lecture 14: Linguistic Challenges

Final project details:

Groups, just like normal projects.
Scope: something that with an extra month of work could turn into a paper at a major conference (ACL, EMNLP, NAACL, etc....)
Four styles of projects:

Reimplement some cool, interesting system from the past 2 or 3 years in an area of NLP/CL that you think is interesting. (+1 month would be to figure out what it does wrong, propose some new idea, and then get a paper out of it.)
Take some problem you care about from your own research and look at how NLP can help it. (You're a better judge than I am about turning this into a paper, but at least you care about the topic!)
Pick some interesting, and under-studied linguistic phenomena (like the things we're talking about today) and do a study on this phenomena. Things not in English are totally allowed, but it would help if you knew the language you were working with. If you want to do data collection using mechanical turk, I have a budget of $200 for this: we can see how many people want to do this and then try to divide fairly (note: fairly is not the same as evenly).
Something else you can convince me is interesting...

You will be graded entirely on your writeup, which should look more or less like a 4-6 page ACL workshop paper (you're encouraged to use the ACL style files).

Here are some examples of linguistic phenomena that I really don't have much idea how to handle. After are some additional project ideas....

Metaphor. This can be narrowly construed ("John devoured his steak like a lion") or broadly construed ("John drove Mary crazy"). The usual theory is that there are concrete, physical verbs/nouns; and then when we want to talk about something abstract, we ground it in one of these physical verbs. So a metaphor will usually consist of one abstract thing and one physical thing. There are interesting things to be done both on interpreting metaphors as well as generating metaphors. I have a small amount of data.
Quantifier degree. I can say "Most men are jerks" and I can say "Most jerks drink beer" but does this mean that most men drink beer (or rather, that I think that most men drink beer?). What about other quantifiers? How consistent is this across different statements ("Most men are jerks" versus "Most animals are invertebrates" -- is the meaning of "most" the same)? I don't have any data, but you could collect it.
Metonymy. This is when I say "The White House" and really mean "Obama" (or at least I mean "someone from the White House"). This combines with ideas of selectional preferences, where we might be able to figure out that the agent (or subject) of "said" should be animate, and might know that the White House is not animate, and so it must be metonymic. Can we identify metonymic usages? Can we resolve them? I have a reasonable amount of data.
Quantifier scope. This is a classic semantics problem. "All the women on the island have a husband." Does this mean that each woman has her own husband, or that there's one man who is married to all of the women simultaneously. Similarly, "Each man entered the bar with a dog" -- how many dogs are there? I have no data.
Negation scope. This is perhaps particularly interesting in the case of sentiment analysis. "I did not like the plot but liked the acting." -- Does "not" scope over just "like the plot" or over the full "like the plot but liked the acting"? There is also an interaction with quantifiers ("No man should marry a leprechaun"), but this is both rare and hard. I have no data, but I think we could find some.
Language change. We have English documents going back 500 years, but English back then looks a lot different than English now. Moreover, our part of speech taggers and parsers that work well now won't work well on Shakespeare (I would guess :P). Can we track changes back in time, and use this information to get today's models to do a good job way back when? I have a lot of data.
Traces and Ellipsis. "John went to the store but Jerry didn't ______" -- what was it that Jerry didn't do. This is a form of a trace. The Penn Treebank comes with some but not all traces spelled out. Try to figure out where they should be and what they refer to.
Exaggeration. No one has ever looked at this problem. Okay that's not true, but only because it's an overstatement. Can we automatically identify overstatements in text? Can we tone them down or at least identify the degree of overstatement? I have no data, but it shouldn't be too hard to find some.
Emotion from text. Can we identifying character's emotions by looking at what they say, for instance in novels. I have data of quoted speech, paired with emotional strength -- can we predict emotional strength given the quoted speech and perhaps the context in which it was spoken? I have lots of data.
Language and X. Often we can get language paired with objects in another medium, for instance images and captions. Can we identify correspondences? Or perhaps generate captions from images? We have (or can get) similar data for comments and code, or formal specifications (variant of first order logic) and descriptions. Or database queries and text. I have some data, depending on what you want "X" to be.
Shannon game (word/character prediction) with domain knowledge.... The Shannon game is a way of measuring entropy of a language. Perhaps make an online game (like hangman?) where we can get people to play this game, and then can also tell them things like "this is a news story" or "this is a children's book" or "this is a medical text" or something and see how it affects their ability to guess the next work.
"Translating" from tweets to language that your parents could understand. We have lots of SMS/twitter data, and lots of normal text. A lot of this is abbreviation expansion, but some is not. Can we figure out which terms in tweets should be expanded, and then try to find their expansions in normal text? There are at least two types ("something" -> "sth" or "laugh out loud" -> "lol"), but maybe more. We can get lots of data easily.
Identifying character relationships in literature. We can track character interactions in literature, but the relationship between these characters is not known. And perhaps it changes over time. Can we figure out how two characters in a novel feel about each other on the basis of what they do/say to each other? I have some data and we can get a lot more easily.
Many other languages have phenomena that English does not. For example, subject dropping happens in lots of languages (eg., Japanese) when it's clear from context. Also stuff in the Semantic Analysis chapter of J+M is pretty much up for grabs.

I'm sure there are lots of things I'm not thinking of that would be fun/interesting to do. This is your opportunity to be creative and choose something that just plain sounds cool.

19 October 2010

Lecture 13: Interpretation as Abduction

"Interpretation as Abduction" is one of the more infuential computational semantics papers from its era (and really, ever).

The basic claim is that through (weighted) abduction, one can explain a whole host of complex linguistic phenomena:

Reference resolution ("John_1 saw himself_1 in the mirror" -- where "John" and "himself" co-refer)
Interpretation of compound nominals ("computer science" versus "cs department")
Syntactic ambiguity (eg., PP attachment)
Metonymy ("The White House announced that..." really means that someone in the White House announced something)
Etc....

In modern terms, if you're familiar with probabilistic modeling, interpretation as abduction can be through of as computing most probable explanations in graphical models / databases. (If that means nothing to you, that's fine, too.)

Abduction is the type of inference we try to do in science. We observe something, and we try to posit what caused this to happen. For instance, we may observe that apples fall from trees. We can posit this is due to gravity. Or we can posit that this is due to little elves that hang out in apples and little pots of gold in the earth and the elves are trying to get to the pots of gold. One of these is a more plausible explanation than the other, since it requires us to invent fewer things. (Though as far as I know, no one really knows how gravity works... so maybe it's a bad example... maybe it is elves/gold.) This is often referred to as Occam's Razor: of all explanations, choose the "simplest."

The key idea to use abduction to do sentence interpretation is:

Prove the logical form of the sentence

together with the constraints that predicates impose on their arguments (eg., the subject of "run" must be animate)
allowing for coercions (eg., we might coerce someone who works at the White House into the White House)

Merging redundancies where possible
Making assumptions where necessary

The last point in the key point: anything we have to assume in order to prove the sentence is the new information conveyed in the sentence.

In weighted abduction, we assume that there is a cost for coercions and making assumptions, and that we want to find the set of things we can assume that have the least possible cost.

The idea is that our database might contain entries like:

If P and Q then R

We will add weights to these to get something like:

If P / w1 and Q / w2 then R

The semantics is that if it costs $c to assume R, then it costs $w1*c to assume P and $w2*c to assume Q.

If we have redundancy... eg., something like "blah blah q(x) and q(y) blah blah" and at some point we derive x=y, then we can merge this to just "blah blah q(x) blah blah" and the cost of q(x) is the minimum of the costs of q(x) and q(y).

For example, we might have:

(forall x) car(x) / 0.8 AND notop(x) / 0.4 IMPLIES convertible(x)

The weights here say that if we know something is a car, it costs less to assume it's a convertible than it would if we had only known it had no top.

One kind of cool outcome of this is that you can make "reverse" inferences fairly easily. For instance, maybe I want to be able to assume that sometimes when people talk about animals, they're actually talking about cats. But they're being vague. I can just add a rule:

(forall x) animal(x) AND etc(x) IMPLIES cat(x)

Here, we're saying that if you're an animal, and if certain other unspecified things ("etc") happen, then you're a cat. Without the etc(), this is obviously false. But with the etc(), we can just add weights and allow ourselves to interpret sentences like "I sensed an animal in the shadows. It meowed.", whereI need to coerce an animal into a cat.

It turns out (see section 6 in the paper) that you can actually handle all of syntax, semantics and pragmatics in this single framework!

18 October 2010

HW06 is posted

15 October 2010

P2 deadline and ML midterm

I've been informed that there's a midterm in Lise's ML class on the 19th, which coincides with the P2 deadline.

To try to help you not lose too much sleep, I've pushed the deadline for P2 back to Friday night at 10p. However, to be fair to those who don't have ML, or who have planned their schedule around the deadline on the 19th, I'll give you 15% extra credit if you hand it in by the original deadline (19th). If you do hand it in early, please make a note in your writeup that you've done so. Hopefully 15% is enough to balance out anything that you would lose by not being able to compete up to the last minute with your classmates on the competition parts.

Also, a note on pruning:

This is now in the .pdf writeup, but:

Be sure that you prune all cells, including the right-most ones. Your labeled F measure for unpruned should be about 78%.
Don't forget to debinarize before you evaluate!
If you prune every cell, including the root, with K=2, your labeled F measure should be about 23%. You should have about 83 unparsable sentences.
If you prune every cell except the root (which is a totally sensible thing to do -- thanks to several of you who have pointed this out -- your labeled F should be about 59%, and you should have about 38 unparsable sentences. (As you can see, it's a really good idea not to prune the root, especially since it doesn't make things any slower -- In fact, it might actually make things faster!)
You can implement it either of the above two ways -- just be sure to document which you're doing.
Be sure to prune before applying unary rules. Thus, even with K=2, you will often end up with more than 2 items in a chart cell. However, only 2 of these will have binary derivations; all the rest should have unary derivations.

As I said in class, the danger with me trying to give you hints is that often you guys end up doing a better job implementing this stuff than I do! But I'm pretty sure the above numbers are correct -- I've inspected my charts to make sure they look okay.

14 October 2010

Lecture 12: First Order Logic

So far we've talked a lot about syntax. Of course, syntax isn't the end-all be-all of language processing: we want to "understand" language. Often this means mapping natural language sentences to some form of formal language sentences. This means we have to decide on a formal language. A common choice is first order logic (FOL).

Expressions in FOL take one of the following forms:

Atomic symbols like "Hal"
Relation symbols like "Eats(.,.)", for instance in "Eats(Hal, Sandwich)". Relations can have arbitrary arity (i.e., take arbitrarily many arguments)
Variables like "a" (see later for quantification)
Combinations via and, or, not, implies, iff, etc. Eg., Eats(Hal, Sandwich) implies not OnThePlate(Sandwich)
Existential and universal quantifiers. For example, "(exists a) Eats(Hal, a)" (which roughly means "Hal eats something.")
Equality symbol. For example, "(exists a) Eats(Hal, a) and (a = Sandwich)"

Of course a lot of this is redundant: you only really need negation and "and" and one quantifier and relations and equality and can derive the rest.

FOL is a nice language to map into because we know a lot about how to reason in FOL. Reasoning is, of course, hard (take an AI class), but sort of do-able.

The general way in which FOL systems work is:

We have a bunch of background knowledge expressed as "axioms" -- basically a database of FOL sentences
We get a text, from which we derive new FOL sentences
Someone issues a query, which we will try to "prove" given our collection of sentences. A proof of a query is an answer to the question

Someone might query "Who shot Lincoln?"
We map this to something like "(exists a) Shot(a, Lincoln)"
We try to prove this statement given our database, which might include some statement like "Shot(Booth, Lincoln)". Thus, "a=Booth" is a "proof" of our query, and gives us the answer

The big problem is often interpretation: how to get from natural language to formal language. Syntax is a bridge that will help us get there, but it's not everything.

The common assumption (which is not always true) is that language is compositional. Namely, the meaning of a sentence is derived from the meanings of its parts. For examples:

Hal ate a sandwich with mayo.
Parses to (Hal (ate (a sandwich (with mayo))))
Bottom up, we can look at (mayo) and derive "(exists a) Mayo(a)"
Then (sandwich (with mayo)) and derive "(exists b) Sandwich(b) and Contains(Sandwich, ___)" where ___ comes from the interpretation of (with mayo), yielding "(exists a,b) Mayo(a) and Sandwich(b) and Contains(a,b)"
Then (a sandwich ...), takes what we had before and makes "b" unique, usually written as "(exists a, exists!b) ..."
Then we take Hal and get "Hal"
And finally, "x ate y" will give us something like "(exists e) Eat(e, x, y) and InThePast(e)" which when we plug in with x and y gives the final interpretation:
(exists a, exists !b, exists e) Mayo(a) and Sandwich(b) and Contains(a,b) and Eat(e, Hal, b) and InThePast(e)

The hope is that this interpretation is agnostic to annoyances of syntax. For example, if we instead wrote "The sandwich containing mayo was eaten by Hal", we would end up with the same interpretation.

In order to make this happen, we usually need a semantic grammar that tells us how to get from syntax to semantics. The grammar will consist of rules that are akin to the type of rules we applied to do the semantic analysis of the above example sentence.

Key challenges with FOL:

What are the right sets of symbols and relations?
Where does the necessary background knowledge come from?

You can try to get humans to do it for you: see the Cyc project
Or you can try to extract it automatically: see ReadTheWeb or ODIE or TextRunner, for example

How to get the semantic grammar

You can hand code it (see verbnet and the unified verb index for a verb-centric view).
Or you can try to learn it from examples (see Ray Mooney's talk yesterday)

Sometimes need more than FOL

How to handle "Many students love NLP"?
Or "John did something yesterday" -- for this we may need second order logic, where you can quantify over relations -- "(exists R, e) R(e, John) and InThePast(e)"

12 October 2010

Project 2 warning on pruning results

I had a bug in my pruning, and wasn't pruning the right-most cells. (It was an off-by-one error.) Since English is mostly right-branching, the cells that touch the right edge (i.e., end of sentence) are incredibly important, so not pruning them makes life a lot better than otherwise. I'm working on fixing it right now and checking to make sure there aren't other bugs. I'll post a comment to this post when it's done and I've updated the results in the .pdf, but your pruning results should be much worse than what is in the writeup (i.e., with K=2, you should get no parse on 80-something sentences, and your f-score should be in the 10s or 20s).

Project 2 stuff...

I fixed the P2 pdf (just the one linked on the web page, not the one in the .tar.gz file) so that it includes unary productions in the examples. This is only relevant for part 2 -- nothing else changed.

Also, I'm not sure if I included them previously, but here are the links for the online test scripts:

HW05 is posted

(Sorry for the delay -- there's just one question :P)

Lecture 11: Incorporating Context

Many languages express some form of agreement. In English it is mostly Subject/Verb agreement (i.e., "He eats a sandwich" versus "They eat a sandwich"). In other languages agreement can be more widespread: Adjective/Noun, Determiner/Noun, Verb/Object, Noun/Relative Clause, etc.

Suppose we want to capture this with a (P)CFG.... the only real way to do this is to expand our set of non-terminals. I.e., we might start with a grammar like:

S -> NP VP
NP -> Pro
NP -> Det Noun
VP -> Verb NP
Pro -> he
Pro -> they
Verb -> eat
Verb -> eats
Det -> a

Which will generate our good sentences, but also bad sentences like "He eat a sandwich" and "They eats a sandwich". We can fix this by adding labels to our non-terminals:

S -> NP+sing VP+sing           S -> NP+plur VP+plur
NP+sing -> Pro+sing            NP+plur -> Pro+plur
NP+sing -> Det Noun+sing       NP+plur -> Det Noun+plur
VP+sing -> Verb+sing NP        VP+plur -> Verb+plur NP
Pro+sing -> he                 Pro+plur -> they
Verb+sing -> eats              Verb+plur -> eat
Det -> a

Now life is great: we will no longer produce our bogus sentences. (Note that we would also have to account for "a" being singular, etc...)

Of course there are other phenomena we might want to capture, for instance pronouns sitting in object position have to change case: "A sandwich eats him" not "A sandwich eats he". If we ignore the singular/plural distinction, we can solve this separately.

S -> NP+nom VP
VP -> Verb NP+acc
NP+nom -> Pro+nom         NP+acc -> Pro+acc
NP -> Det Noun
Pro+nom -> he             Pro+acc -> him
Pro+nom -> they           Pro+acc -> them

And so on...

We've now solve the nominative/accusative issue, so we can combine the sing/plur grammar with the nom/acc grammar. The problem is exponential blowup: handling one roughly doubled the size of the grammar, handling the other doubles it again. In general, for every agreement-like phenomenon we want to handle, we'll double the size of our grammar again. This is bad (a) because it makes our grammar huge and (b) because we'll need tons and tons and tons of data to get reliable estimates of these probabilities from a treebank!

(Other English phenomena you might want to capture are selectional preferences -- which are really more semantics issues than syntactic issues; or determiner/noun agreement.)

One solution to this problem is to augment our grammar with features. That is, basically do what linguists do and write "+sing" on the pre-terminal, but not split the grammar into bits. Instead, we need to come up with ways of propagating these features up the tree, and checking for disagreement. This process is known as "unification".

The idea of unification is that everything has a bunch of features with values. Values can be atomic (number=sing) or disjunctions (number={sing,plur}) or variables (number=A). Two values unify basically if they have a non-empty intersection (it's actually more complicated than this, but this is the basic idea). So sing and sing unify and produce sing. Similarly sing and {sing,plur} unify and produce sing. A and sing unify and produce sing with a note off on the side of the page that A=sing. sing and plur don't unify. Our grammar will now encode what unifications happen at each step (eg., when producing a S, you want to unify the number feature of the NP and VP; or when producing a VP, you want to unify the object NP with "case=accusative" and likewise for subject NPs).

Our grammar is now a set of context free productions and unification rules. One nice thing is that chart parsing basically still works, but our parse cells become more complicated because they have to remember the values of features. But at least our grammars don't blow up!

07 October 2010

Lecture 10: Statistical Parsing

Context free grammars are great, but as with any X, if X is great, then probabilistic X is greater still!

A probabilistic context free grammar (PCFG) is a CFG with "weights" attached to each of the rules. These weights must be non-negative. For any left-hand-side, the sum of weights must be one. The following is a small PCFG:

S   0.5 -> NP VP
S   0.5 -> VP
NP  0.7 -> Det Noun
NP  0.2 -> Proper
NP  0.1 -> NP PP
VP  0.6 -> Verb NP
VP  0.3 -> Verb
VP  0.1 -> VP PP

Note that for any given LHS (eg., "NP") the sum of weights (probabilities) is one.

The probability of a tree is just the product of probabilities of rules used in that tree.

We can generate random sentences with a PCFG by starting at "S" and then picking a rule at random (weighted by the probabilities) and then recursively generating the corresponding children.

The probability of a sentence under this grammar is obtained by summing (marginalizing) out all the trees that can generate that sentence.

To parse with a PCFG, we just parse with the corresponding CFG, but then weigh items in the chart by their probabilities. If we find a higher-probability derivation of the same item in the chart, we throw out the old one and replace it with the new one. In CKY this is easy.

A treebank is a collection of human-labeled trees. Given a treebank, we can extract the corresponding PCFG by counting productions and normalizing. Most trees are not binary so you need to binarize your rules.

Note on smoothing: you have to smooth if you want your grammar to be able to parse out-of-sample trees. Smoothing can be done just like in HMMs, but you want to be careful not to compute stuff that is irrelevant.

Note on efficiency: The grammars we produce are huge (tens or hundreds of thousands of rules). As a result, paying O(N^3|G|) is often prohibitive, even for N around 30 or 40. A simple solution is pruning: after you process each cell in the chart, throw out it's low probability items. You can either do count-based pruning (only keep the top 5 in each cell), ratio-based pruning (only keep the ones within a factor of 10 probability of the best one), or probability-based pruning (only keep the ones with probability at least 10e-6). Usually the last sucks, but the first two work reasonably well.

04 October 2010

No class Tue 05 Oct 2010

Due to Hal being sick again. Sigh.

So here's a puzzle to entertain you while you're not learning all about treebanking and PCFGs....

Suppose Hal presents symptoms that make it 90% certain that he has some illness X. And suppose that the rate of transmission of X is 5%. Suppose that the average rate of incidence of X in a general population is 0.01%. What is the probability that you would get sick if we had class tomorrow (and you showed up)? There are 45 students in class: what's the probability that at least one would get sick? Finally, suppose that your utility function is defined as U(learn about pcfgs and don't get sick) = +10, U(don't learn and don't get sick) = 0, U(learn and get sick) = -20, U(don't learn and get sick) = -100. (How annoying would that be.) What is the expected utility if I do teach class? What is the expected utility if I cancel class? Am I doing the right thing?

Feel free to post solutions :).

01 October 2010

Projec 2 is posted, due Oct 19

It's all about parsing. The writeup is really long, but don't be too intiminated by it. It's long because I tried to give a lot of debugging help. The project is all about grammars and parsing, so go off and do syntax!

28 September 2010

P1: free token for everyone!

Everyone should have one extra token, as a token (hahahahahahaha) of good will.

Remember that P1 is due at 2:20pm, and that HW04 is due next Tuesday before class.

27 September 2010

Lecture 9: Dynamic programming and the CKY algorithm

We will begin by briefly attempting to recover from my confusion explanation of X-bar theory, and talk about specifiers, complements and adjuncts. We'll then finish up the DP versus NP distinction and IP for S, and then movement. We'll then get to the computational side...

Parsing as search

At a basic level, the parsing problem is to find a tree rooted at S (or IP...) that spans the entire sentence. We can think of this as a (bottom-up) search problem over partial hypotheses (eg., hypotheses that span an initial segment of the sentence).

At any given time, we've built a partial tree over the first n words
Want to extend it to include the n+1st word
A search step involves:

Choose a rule from the grammar
Choose existing constituents that match that rule
Create a new constituent

Note that there are other ways to search, eg., top-down.

Dynamic Programming

Search can be expensive because we end up re-creating the same constituents over and over again. This suggests dynamic programming. For the following, we will assume our grammar has been binarized, so rules always look like "A -> B C". Unary rules (eg., "A -> B") are do-able, but are a simple extension.

We will write sentences as "0 The 1 man 2 ate 3 a 4 sandwich 5" where the numbers indicate positions. A partial hypothesis has a label (eg., "NP") and a span (eg, "from 0 to 2"). We will write this as NP[0,2].

The key insight in dynamic programming is to store a chart. A chart is a two-dimensional array of (eg.) sets. Chart[i,j] will store all possible constituents that can span the range from i to j.

Initialization:

For each position i=1..|sentence|

For each possible part of speech P for word i

Insert P into Chart[i-1,i]
(This means that we've found a way to cover i-1 to i using non-terminal P)

Iteration:

For phrase lengths l=2..|sentence|

For each start position i=0..|sentence|-1

Set end position k=i+2
For each split point j=i+1..k-1

For each grammar rule X -> Y Z
If we can find a Y in Chart[i,j] and a Z in Chart[j,k]:

Add X to Chart[i,k] with backpointers ala Viterbi
(This means we've found a way to cover i to k using the non-terminal X)

24 September 2010

HW04 is posted

23 September 2010

Lecture 8: Context Free Grammars

Context free grammars:
* strictly more powerful way of expressing languages than a regular expression
* it's languages are called context free languages
* many human languages are approximately context-free (there are always exceptions), while most are not regular

Context free grammars consist of:
* An alphabet (the "words" in the language)
* A set of non-terminals
* A "start" non-terminal
* A set of rewrite rules for getting from non-terminals to either other non-terminals or terminals

Simple context free grammar:

S  -> NP VP
NP -> D N
VP -> V NP
D  -> "the"
D  -> "a"
N  -> "man"
N  -> "sandwich"
V  -> "ate"
V  -> "likes"

This is actually a finite language (what are the sentences in this language)?

Derivations are "proofs" that a sentence belongs to the language by giving a/the sequence of productions that leads to that sentence.

We'll usually draw trees as proofs, but other formats are more useful as text, for instance: (S (NP (D the) (N sandwich)) (VP (V ate) (NP (D the) (N man))))

In such grammars, we will often call the things just above the terminals (parts of speech) "pre-terminals".

We can make this grammar more fun by adding rules:

NP -> NP PP
VP -> VP PP
PP -> P NP
P  -> "with"

Now we have an infinite grammar that can generate strings like "the man with the man with the sandwich with the man with the ..."

Note also that this grammar is now ambiguous. "the man likes the man with the sandwich" could either be (S (NP the man) (VP (V likes) (NP (NP the man) (PP (P with) (NP the sandwich))))) or (S (NP the man) (VP (VP (V likes) (NP the man)) (PP (P with) (NP the sandwich)))). The first one has the "correct" meaning, while the second would mean that he uses the sandwich to like the man.

A simple context free language that is not regular is:

S -> a S b
S -> ab

Which describes the language "{ a^n b^n : n > 0}" of strings of equal numbers of as and bs. You can show this language is not regular formally, or you can recognize that a FSM would kind of have to have a N-many states to recognize the finite version of this language up to sequences of length N.

A popular approach for context free models of human language is "X-bar" theory. Here, we have base categories (X0 = N,V,A,P), then "projections" of these categories: X' (aka X-bar), and X'' (aka X-double-bar aka XP). The general form of a phrase is (XP (YP (X' (X0 ZP)))), where YP is the "specifier", X0 is the head, and ZP is the complement (argument). Only full projections can fit in as adjunctions, specifiers or complements.

The idea is that the arguments fit in the "bar" slots, and the adjuncts fit in the "phrase" slots. Roughly, an argument is something that is sort-of necessary and an adjunct is sort-of optional. Usually you can have arbitrarily many adjuncts, but only a fixed number of arguments. For instance in "the man ate a sandwich of turkey with lettuce on a plate with gusto at the restaurant" the verb "ate" has one argument (sandwich) and two adjuncts ("with gusto" and "at the restaurant") while sandwich has three adjunctions "with turkey, with lettuce and on a plate).

Here are some example sentences in X-bar style analyses:

(PP (DP (D' (D so))) (P' (ADVP (ADV' (ADV completely))) (P' (P in) (NP (N' (DP (D' (D the))) (N wrong))))))
(IP (NP Poirot) (I' (Infl will) (VP (V' (V abandom)) (NP (N' (DP (D' (D the))) (N investigation))))))
Identical structure for the second half of "I did not expect Poirot to abandon the investigation."
Under some theories, (IP (NP Poirot) (I' (Infl +ed) (VP (V abandon) (NP the investigation))) is appropriate

In 5, we see an example of movement, a very linguisticy notion that has not been adopted much by computationalists. Another example of movement is wh-movement ("Hal likes CL" vs "What does Hal like ____"). There can also be duplication, as in "The man who I saw _____".

(Warning: by now, lots of people no longer think NPs are really headed by nouns, but are rather headed by determiners, so that what we usually think of as an "NP" is really a "DP." That is, instead of "(NP (N' (DP (D' (D the))) (N man)))" you would have "(DP (D' (D the) (NP (N' (N man)))))". In this class, we'll always use NP and not DP, but you should know that this is slightly outdated.)

21 September 2010

HW3: you may ignore the part about Chomsky normal form

If you've already done it, or choose to do it anyway, I'll give you a tiny bit of extra credit. I thought we'd be a bit further alon by now when I wrote that homework a few weeks ago. (You should still make a semi-valiant effort on the first part of the CFG question).

16 September 2010

Lecture 6: Part of Speech Tagging

Woohoo, some linguistics!

Parts of speech are useful for several things:

Knowing how to pronounce a word (object noun vs verb)
Knowing what words come around it (personal pronouns vs possessive pronouns)
Knowing how it behaves syntactically -- most verbs behave the same and are "substitutable" from a syntactic perspective (though probably not from a semantic perspective)

Broadest categorization: open class versus closed class

Note: not all parts of speech exist in every language, and if a words is in one part of speech in language X, it's translation isn't necessarily the same part of speech in language Y. (Eg., beautiful in English/Korean, or green in English/Japanese).

Some major open classes:

Nouns: things. Concrete (ship, chair), abstract (happiness) or verb-like terms "gerunds" (running), proper "Hal"
Verbs: actions. Eat, drink, sleep; English has auxiliaries but these are usually considered closed class (see below)
Adjectives: mostly modify nouns (blue, big, old)
Adverbs: mostly modify verbs (slowly, very) but also other weird things (every, unfortunately, not)

Almost every language has Nouns and Verbs, and many have Adjectives and Adverbs.

Some closed class:

Prepositions: the bunny ran ___ the hill
Determiners: articles: a, an, the; others: this blog
Pronouns: (personal) she, I, who, (possessive) my, her, (wh) what, how
Conjunctions: and, but, either/or
Auxiliary verbs: would, could, can, to -- but not the preposition "to"
Particles: like prepositions, but go with verbs (put down, raise up, etc.)
Numerals: one, two, three, first, second, third, etc.

Part of speech tagging is the task of assigning parts of speech to words in a sentence. About 11% of English word types are ambiguous, but over 40% of word tokens are.

Rule based approaches usually consist of:

A small lexicon that covers some common open-class words and most common closed-class words
A set of rules for disambiguation
A set of heuristics for guessing the tag of unknown (open class) words

Lexicon is easy. Heuristics are things like:

Does it end in "ed"? "ly"? "ing"? "s"?
Is it capitalized? (Doesn't apply to P1)

If these don't help, then it's probably an unknown open class word and we use context (rules) to disambiguate.

Rules get more complicated...

Noun/Verb disambiguation is usually the biggest issue

Is there a determiner or adjective to my left? How far?
Am I at the end of a sentence?
Is the an auxiliary to my left? An adverb?

There's also Verb/Aux disambiguation

Verb Aux is rare order, but Aux Verb is common
Is there a verb anywhere in the sentence?

Noun/WH-complementizer disambiguation

John saw that book
John saw that book burn up
Is there a verb to the right, nearby?

There are other systematic ambiguities. The book talks about some, and you can try to find more in P1!

15 September 2010

Slightly updated notes.sty file

If you're annoyed with the "solution" environment in the notes.sty file that I've provided, please download the new one off the web page. It is much friendlier and will let you have better equations, line breaks, etc., inside solutions. (Thanks to your fellow student, Ke Wu, for the patch!)

14 September 2010

Lecture 5: N-gram models

Today we'll talk first about estimation of:

Bernoulli random variables (coin flips)

p(Heads) ~= Count(Heads) / Count(total flips)

Multinomial random variables (die rolls)

p(die comes up "4") ~= Count("4"s) / Count(total rolls)
Now think about a giant die with |Vocabulary|-many sides!

Then about how this relates to language modeling (i.e., determining whether a sentence looks like "English" or not).

We'll then run through the chain rule to break p(s) = p(w_1 w_2 ... w_N) = prod_n p(w_n | w_{1..n-1}) and then apply a Markov assumption to the terms in the product to arrive at the n-gram language model.

This language model will be a product of Multinomial random variables, so we can apply everything we know about estimation.

Finally, we'll talk about:

Smoothing (adding fake counts)
Structural zeros versus statistical zeros

Where the goal of smoothing is to not think that possible things are impossible.

(Warning: the treatment we give of out of vocabulary -- "OOV" -- terms will not be very detailed!)

Slight change to P1 "tagger" script

The old rules were that if you ran the online tagger at time X, you would have to wait until time X+24 hours to run it again.

To be nice to people who have work schedules, this has been "softened." The new policy is that every 24 hours you earn one "token." You spend tokens by running the tagger. However, you can only keep a total of three tokens at any one time.

The effect of this policy is that you could continue doing one handin every 24 hours, or you could do two every 48 hours, or three every 72 hours. But not more than that. So on average it's the same, but you get a bit more flexibility.

The script is a bit more complicated now (the logic is harder), so please let me know if you run into any bugs.

Also, if you run the handin and just enter your ID, but not a file to upload, it will tell you how many tokens you have left and, if it's less than three, the date/times that those tokens were used so you can figure out how long until you get a new one (you get a new one when the oldest one "dies" -- i.e., 72 hours pass since the oldest one).

Hopefully this is clear :).

13 September 2010

HW03 is posted

(You'll need to look at it for the HMM part of P1.)

10 September 2010

Project matchmaking

If you need, or want, one or more partners for the first project (you're free to switch up between projects if you'd like), please email me now (before you forget!). We'll have a 5-10 minute matchmaking session at the end of class on Tuesday. When you email me, please give me your name and your department (CS, EE, Linguistics, etc...) and whether you have a team at all (eg., a team of two who wants one more) or are currently alone (i.e., need one or two people to make a team).

HW02 is posted

09 September 2010

Lecture 4: Probability and statistics

You should be sure that you are comfortable with the basic rules of probability:

Conditional probabilities: what does p(a|b) mean?
Marginalization: p(a) = sum_b p(a,b)
Chain rule: p(a,b) = p(a) p(b|a) = p(b) p(a|b)
Iterated chain rule: p(a,b,c,d,...,z) = p(a) p(b|a) p(c|a,b) ... p(z|a,b,c,...,y)
Bayes' rule: p(a|b) = p(a) p(b|a) / p(b)
Generalized Bayes' rule: how to do BR on p(a|b,c) or p(a,b|c)
Expectations: E_p[f(x)] = sum_x p(x) f(x)

You should also be comfortable answering the probability questions from the homework, which we'll work through in class.

You should also be comfortable with some basic statistics:

Maximum likelihood estimates

= relative frequencies
= counting and normalizing

The difference between

Structural zeros
Statistical zeros

The importance of smoothing

07 September 2010

Project 1 is posted, due 30 Sept

Use this space to ask questions / discuss the project.

[Update 14 Sep] Several people have asked what tools/data are allowed for the rule-based tagger and for the "do as well as you can" part. Here is an attempt to clarify. My goal isn't to make life hard on you, but to not give unfair advantages to some subsets of the class.

For the rule-based tagger, any rules are fair game, but it would be better if you made your tagger WITHOUT looking at the labeled training data. That is, feel free to look at the sentences, but pretend that you don't have access to the tags. After all, that's how it would be in the real world. Yes, this means that you'll have to use some linguistics. That's not a terrible thing :).
For the last part of the project, where you have to build the best tagger possible, people have asked about using external tools. For this project, I would ask that anything you use beyond normal library / unix tools stuff, you implement yourself. Yes, you could probably do better using well-engineered tools, but that probably won't help you learn as much as you will if you do something slightly less good, but do it on your own.
(Note that this final rule will not be the case for the final class project: there, anything is fair game!)

06 September 2010

How to get a little extra credit...

Students always ask me about extra credit opportunities at the end of the semester. This is your opportunity to be pro-active. The CL lab here has a regular colloquium series, which has a shared calendar here. Roughly every other week we have an external speaker coming through. The first few are Roger Levy (psycho/cognitive computational linguistics from UCSD), William Cohen (machine learning and information extraction, CMU) and Eugene Charniak (parsing and everything else NLP, Brown). (Eugene is actually my great-grand-advisor!). I highly encourage all of these talks, and they should all be quite different. We'll have more folks posted on the schedule as the semester moves on. At any rate, if you go to the talk, and write a "response" that's approximately one page long, I'll give you some extra credit. You'll get (up to) 3 points (one homework) for the first one you do, (up to) 2 points for the second and third, and 1 point for the fourth and fifth. That's 9 points of possible extra credit. You won't get further credit for more than five, but of course I encourage you to keep going to the seminars :).