UMD CMSC 723: Computational Linguistics I: September 2010

28 September 2010

P1: free token for everyone!

Everyone should have one extra token, as a token (hahahahahahaha) of good will.

Remember that P1 is due at 2:20pm, and that HW04 is due next Tuesday before class.

27 September 2010

Lecture 9: Dynamic programming and the CKY algorithm

We will begin by briefly attempting to recover from my confusion explanation of X-bar theory, and talk about specifiers, complements and adjuncts. We'll then finish up the DP versus NP distinction and IP for S, and then movement. We'll then get to the computational side...

Parsing as search

At a basic level, the parsing problem is to find a tree rooted at S (or IP...) that spans the entire sentence. We can think of this as a (bottom-up) search problem over partial hypotheses (eg., hypotheses that span an initial segment of the sentence).

At any given time, we've built a partial tree over the first n words
Want to extend it to include the n+1st word
A search step involves:

Choose a rule from the grammar
Choose existing constituents that match that rule
Create a new constituent

Note that there are other ways to search, eg., top-down.

Dynamic Programming

Search can be expensive because we end up re-creating the same constituents over and over again. This suggests dynamic programming. For the following, we will assume our grammar has been binarized, so rules always look like "A -> B C". Unary rules (eg., "A -> B") are do-able, but are a simple extension.

We will write sentences as "0 The 1 man 2 ate 3 a 4 sandwich 5" where the numbers indicate positions. A partial hypothesis has a label (eg., "NP") and a span (eg, "from 0 to 2"). We will write this as NP[0,2].

The key insight in dynamic programming is to store a chart. A chart is a two-dimensional array of (eg.) sets. Chart[i,j] will store all possible constituents that can span the range from i to j.

Initialization:

For each position i=1..|sentence|

For each possible part of speech P for word i

Insert P into Chart[i-1,i]
(This means that we've found a way to cover i-1 to i using non-terminal P)

Iteration:

For phrase lengths l=2..|sentence|

For each start position i=0..|sentence|-1

Set end position k=i+2
For each split point j=i+1..k-1

For each grammar rule X -> Y Z
If we can find a Y in Chart[i,j] and a Z in Chart[j,k]:

Add X to Chart[i,k] with backpointers ala Viterbi
(This means we've found a way to cover i to k using the non-terminal X)

24 September 2010

HW04 is posted

23 September 2010

Lecture 8: Context Free Grammars

Context free grammars:
* strictly more powerful way of expressing languages than a regular expression
* it's languages are called context free languages
* many human languages are approximately context-free (there are always exceptions), while most are not regular

Context free grammars consist of:
* An alphabet (the "words" in the language)
* A set of non-terminals
* A "start" non-terminal
* A set of rewrite rules for getting from non-terminals to either other non-terminals or terminals

Simple context free grammar:

S  -> NP VP
NP -> D N
VP -> V NP
D  -> "the"
D  -> "a"
N  -> "man"
N  -> "sandwich"
V  -> "ate"
V  -> "likes"

This is actually a finite language (what are the sentences in this language)?

Derivations are "proofs" that a sentence belongs to the language by giving a/the sequence of productions that leads to that sentence.

We'll usually draw trees as proofs, but other formats are more useful as text, for instance: (S (NP (D the) (N sandwich)) (VP (V ate) (NP (D the) (N man))))

In such grammars, we will often call the things just above the terminals (parts of speech) "pre-terminals".

We can make this grammar more fun by adding rules:

NP -> NP PP
VP -> VP PP
PP -> P NP
P  -> "with"

Now we have an infinite grammar that can generate strings like "the man with the man with the sandwich with the man with the ..."

Note also that this grammar is now ambiguous. "the man likes the man with the sandwich" could either be (S (NP the man) (VP (V likes) (NP (NP the man) (PP (P with) (NP the sandwich))))) or (S (NP the man) (VP (VP (V likes) (NP the man)) (PP (P with) (NP the sandwich)))). The first one has the "correct" meaning, while the second would mean that he uses the sandwich to like the man.

A simple context free language that is not regular is:

S -> a S b
S -> ab

Which describes the language "{ a^n b^n : n > 0}" of strings of equal numbers of as and bs. You can show this language is not regular formally, or you can recognize that a FSM would kind of have to have a N-many states to recognize the finite version of this language up to sequences of length N.

A popular approach for context free models of human language is "X-bar" theory. Here, we have base categories (X0 = N,V,A,P), then "projections" of these categories: X' (aka X-bar), and X'' (aka X-double-bar aka XP). The general form of a phrase is (XP (YP (X' (X0 ZP)))), where YP is the "specifier", X0 is the head, and ZP is the complement (argument). Only full projections can fit in as adjunctions, specifiers or complements.

The idea is that the arguments fit in the "bar" slots, and the adjuncts fit in the "phrase" slots. Roughly, an argument is something that is sort-of necessary and an adjunct is sort-of optional. Usually you can have arbitrarily many adjuncts, but only a fixed number of arguments. For instance in "the man ate a sandwich of turkey with lettuce on a plate with gusto at the restaurant" the verb "ate" has one argument (sandwich) and two adjuncts ("with gusto" and "at the restaurant") while sandwich has three adjunctions "with turkey, with lettuce and on a plate).

Here are some example sentences in X-bar style analyses:

(PP (DP (D' (D so))) (P' (ADVP (ADV' (ADV completely))) (P' (P in) (NP (N' (DP (D' (D the))) (N wrong))))))
(IP (NP Poirot) (I' (Infl will) (VP (V' (V abandom)) (NP (N' (DP (D' (D the))) (N investigation))))))
Identical structure for the second half of "I did not expect Poirot to abandon the investigation."
Under some theories, (IP (NP Poirot) (I' (Infl +ed) (VP (V abandon) (NP the investigation))) is appropriate

In 5, we see an example of movement, a very linguisticy notion that has not been adopted much by computationalists. Another example of movement is wh-movement ("Hal likes CL" vs "What does Hal like ____"). There can also be duplication, as in "The man who I saw _____".

(Warning: by now, lots of people no longer think NPs are really headed by nouns, but are rather headed by determiners, so that what we usually think of as an "NP" is really a "DP." That is, instead of "(NP (N' (DP (D' (D the))) (N man)))" you would have "(DP (D' (D the) (NP (N' (N man)))))". In this class, we'll always use NP and not DP, but you should know that this is slightly outdated.)

21 September 2010

HW3: you may ignore the part about Chomsky normal form

If you've already done it, or choose to do it anyway, I'll give you a tiny bit of extra credit. I thought we'd be a bit further alon by now when I wrote that homework a few weeks ago. (You should still make a semi-valiant effort on the first part of the CFG question).

16 September 2010

Lecture 6: Part of Speech Tagging

Woohoo, some linguistics!

Parts of speech are useful for several things:

Knowing how to pronounce a word (object noun vs verb)
Knowing what words come around it (personal pronouns vs possessive pronouns)
Knowing how it behaves syntactically -- most verbs behave the same and are "substitutable" from a syntactic perspective (though probably not from a semantic perspective)

Broadest categorization: open class versus closed class

Note: not all parts of speech exist in every language, and if a words is in one part of speech in language X, it's translation isn't necessarily the same part of speech in language Y. (Eg., beautiful in English/Korean, or green in English/Japanese).

Some major open classes:

Nouns: things. Concrete (ship, chair), abstract (happiness) or verb-like terms "gerunds" (running), proper "Hal"
Verbs: actions. Eat, drink, sleep; English has auxiliaries but these are usually considered closed class (see below)
Adjectives: mostly modify nouns (blue, big, old)
Adverbs: mostly modify verbs (slowly, very) but also other weird things (every, unfortunately, not)

Almost every language has Nouns and Verbs, and many have Adjectives and Adverbs.

Some closed class:

Prepositions: the bunny ran ___ the hill
Determiners: articles: a, an, the; others: this blog
Pronouns: (personal) she, I, who, (possessive) my, her, (wh) what, how
Conjunctions: and, but, either/or
Auxiliary verbs: would, could, can, to -- but not the preposition "to"
Particles: like prepositions, but go with verbs (put down, raise up, etc.)
Numerals: one, two, three, first, second, third, etc.

Part of speech tagging is the task of assigning parts of speech to words in a sentence. About 11% of English word types are ambiguous, but over 40% of word tokens are.

Rule based approaches usually consist of:

A small lexicon that covers some common open-class words and most common closed-class words
A set of rules for disambiguation
A set of heuristics for guessing the tag of unknown (open class) words

Lexicon is easy. Heuristics are things like:

Does it end in "ed"? "ly"? "ing"? "s"?
Is it capitalized? (Doesn't apply to P1)

If these don't help, then it's probably an unknown open class word and we use context (rules) to disambiguate.

Rules get more complicated...

Noun/Verb disambiguation is usually the biggest issue

Is there a determiner or adjective to my left? How far?
Am I at the end of a sentence?
Is the an auxiliary to my left? An adverb?

There's also Verb/Aux disambiguation

Verb Aux is rare order, but Aux Verb is common
Is there a verb anywhere in the sentence?

Noun/WH-complementizer disambiguation

John saw that book
John saw that book burn up
Is there a verb to the right, nearby?

There are other systematic ambiguities. The book talks about some, and you can try to find more in P1!

15 September 2010

Slightly updated notes.sty file

If you're annoyed with the "solution" environment in the notes.sty file that I've provided, please download the new one off the web page. It is much friendlier and will let you have better equations, line breaks, etc., inside solutions. (Thanks to your fellow student, Ke Wu, for the patch!)

14 September 2010

Lecture 5: N-gram models

Today we'll talk first about estimation of:

Bernoulli random variables (coin flips)

p(Heads) ~= Count(Heads) / Count(total flips)

Multinomial random variables (die rolls)

p(die comes up "4") ~= Count("4"s) / Count(total rolls)
Now think about a giant die with |Vocabulary|-many sides!

Then about how this relates to language modeling (i.e., determining whether a sentence looks like "English" or not).

We'll then run through the chain rule to break p(s) = p(w_1 w_2 ... w_N) = prod_n p(w_n | w_{1..n-1}) and then apply a Markov assumption to the terms in the product to arrive at the n-gram language model.

This language model will be a product of Multinomial random variables, so we can apply everything we know about estimation.

Finally, we'll talk about:

Smoothing (adding fake counts)
Structural zeros versus statistical zeros

Where the goal of smoothing is to not think that possible things are impossible.

(Warning: the treatment we give of out of vocabulary -- "OOV" -- terms will not be very detailed!)

Slight change to P1 "tagger" script

The old rules were that if you ran the online tagger at time X, you would have to wait until time X+24 hours to run it again.

To be nice to people who have work schedules, this has been "softened." The new policy is that every 24 hours you earn one "token." You spend tokens by running the tagger. However, you can only keep a total of three tokens at any one time.

The effect of this policy is that you could continue doing one handin every 24 hours, or you could do two every 48 hours, or three every 72 hours. But not more than that. So on average it's the same, but you get a bit more flexibility.

The script is a bit more complicated now (the logic is harder), so please let me know if you run into any bugs.

Also, if you run the handin and just enter your ID, but not a file to upload, it will tell you how many tokens you have left and, if it's less than three, the date/times that those tokens were used so you can figure out how long until you get a new one (you get a new one when the oldest one "dies" -- i.e., 72 hours pass since the oldest one).

Hopefully this is clear :).

13 September 2010

HW03 is posted

(You'll need to look at it for the HMM part of P1.)

10 September 2010

Project matchmaking

If you need, or want, one or more partners for the first project (you're free to switch up between projects if you'd like), please email me now (before you forget!). We'll have a 5-10 minute matchmaking session at the end of class on Tuesday. When you email me, please give me your name and your department (CS, EE, Linguistics, etc...) and whether you have a team at all (eg., a team of two who wants one more) or are currently alone (i.e., need one or two people to make a team).

HW02 is posted

09 September 2010

Lecture 4: Probability and statistics

You should be sure that you are comfortable with the basic rules of probability:

Conditional probabilities: what does p(a|b) mean?
Marginalization: p(a) = sum_b p(a,b)
Chain rule: p(a,b) = p(a) p(b|a) = p(b) p(a|b)
Iterated chain rule: p(a,b,c,d,...,z) = p(a) p(b|a) p(c|a,b) ... p(z|a,b,c,...,y)
Bayes' rule: p(a|b) = p(a) p(b|a) / p(b)
Generalized Bayes' rule: how to do BR on p(a|b,c) or p(a,b|c)
Expectations: E_p[f(x)] = sum_x p(x) f(x)

You should also be comfortable answering the probability questions from the homework, which we'll work through in class.

You should also be comfortable with some basic statistics:

Maximum likelihood estimates

= relative frequencies
= counting and normalizing

The difference between

Structural zeros
Statistical zeros

The importance of smoothing

07 September 2010

Project 1 is posted, due 30 Sept

Use this space to ask questions / discuss the project.

[Update 14 Sep] Several people have asked what tools/data are allowed for the rule-based tagger and for the "do as well as you can" part. Here is an attempt to clarify. My goal isn't to make life hard on you, but to not give unfair advantages to some subsets of the class.

For the rule-based tagger, any rules are fair game, but it would be better if you made your tagger WITHOUT looking at the labeled training data. That is, feel free to look at the sentences, but pretend that you don't have access to the tags. After all, that's how it would be in the real world. Yes, this means that you'll have to use some linguistics. That's not a terrible thing :).
For the last part of the project, where you have to build the best tagger possible, people have asked about using external tools. For this project, I would ask that anything you use beyond normal library / unix tools stuff, you implement yourself. Yes, you could probably do better using well-engineered tools, but that probably won't help you learn as much as you will if you do something slightly less good, but do it on your own.
(Note that this final rule will not be the case for the final class project: there, anything is fair game!)

06 September 2010

How to get a little extra credit...

Students always ask me about extra credit opportunities at the end of the semester. This is your opportunity to be pro-active. The CL lab here has a regular colloquium series, which has a shared calendar here. Roughly every other week we have an external speaker coming through. The first few are Roger Levy (psycho/cognitive computational linguistics from UCSD), William Cohen (machine learning and information extraction, CMU) and Eugene Charniak (parsing and everything else NLP, Brown). (Eugene is actually my great-grand-advisor!). I highly encourage all of these talks, and they should all be quite different. We'll have more folks posted on the schedule as the semester moves on. At any rate, if you go to the talk, and write a "response" that's approximately one page long, I'll give you some extra credit. You'll get (up to) 3 points (one homework) for the first one you do, (up to) 2 points for the second and third, and 1 point for the fourth and fifth. That's 9 points of possible extra credit. You won't get further credit for more than five, but of course I encourage you to keep going to the seminars :).

Lecture 3: Regular languages and finite state machines

A language, in a formal sense, is simply a set of strings. Languages can be defined explicitly, as in L = { the , a , an }, which is a language containing three strings, or implicitly, as in L = all valid English sentences (to the extent that the latter is well defined).
Languages can be finite or infinite. The first example is finite. The second is (probably) infinite, for instance:

Bob watched Alice eat the apple.
Claire watched Bob watch Alice eat the apple.
Doug watched Claire watch Bob watch Alice eat the apple.
Elif watched Doug watch Clair watch Bob watch Alice eat the apple.
... and so on

In this case, we've obtained an infinite language through recursion, and most people agree that recursion is a hallmark of human language (though I saw an article recently saying they found some obscure language that didn't have recursion).

A simpler way to get an infinite language is as follows:

Bob is happy.
Bob is very happy.
Bob is very very happy.
Bob is very very very happy.
... and so on

If we're being formal, we need some way of formalizing these notions of "and so on." Regular expressions are a method of doing this. They contain the following basic operations:

/a/ means the string "a"
/ab/ means match "a" followed by "b"
/a*/ means a sequence of zero or more "a"s
/a?/ means an optional "a"
/a|b/ means "a" or "b"
/^a/ means match "a" but only at the beginning of a string
/a$/ means match "a" only at the end
/a(b|c)/ means match "a" followed by one of "b" or "c"

These are the most basic operations. In addition, some convenient short-hand is usually introduced:

/a+/ means /aa*/ (i.e., one or more "a"s) -- note that "grep" doesn't support this: you have to use "egrep"
/./ means any single character
/[abcde]/ means /a|b|c|d|e/
/[a-z]/ means /a|b|c|...|y|z/
If you need to say "." or "*" or "+" put a "\" before it.

So how can we describe our "very very" language? How about /Bob is( very)* happy\./. This is the infinite language that matches our example above. (Note carefully where the spaces are.) Exercise: write a regular expression for the "watch" language at the top!

One way of defining a "regular language" is "any language that can be described by a regular expression."
Note: regular languages are closed under: concatenation, intersection, disjunction, Kleene closure.

Another way to approach languages is through "machines" or "automata." A finite state automata (FSA) is defined by:

A (finite) set of states
A set of labeled directed edges ("transitions") connecting states. The labels are from the alphabet of our language.
A set of initial states (usually we assume just one)
A set of final states (usually we assume just one)

Note that cycles and self-loops are allowed!

You can think of an FSA as a machine that jumps around states, and emits the "letters" on the edges as it goes. It can only stop in a final state. This is the generative view.

The alternative view is that you give an FSA a string, and it tries to follow the corresponding edges, until is reaches a final state. If it can reach a final state, it "accepts" the string. If it cannot, it "rejects" it. (Note that if one path leads to acceptance and another to rejection, then it accepts.)

What we defined above is a non-deterministic FSA. A deterministic FSA is one in which two outgoing edges from a single state never have the same label.

An additional augmentation is to allow epsilon-transitions, namely, transitions that don't emit any words.

It's not too hard to verify that the set of languages that can be defined by an FSA is exactly the set of regular languages. Moreover, adding epsilon transitions and non-determinism don't give you any extra power.

FSAs are very powerful despite being so simple. They are useful (at least) for the following tasks:

Grammars of person names
Times, dates, emails, addresses, etc.
Morphological analysis / segmentation

However, there is much linguistic structure they cannot capture, eg.:

Center embedding: The mouse the cat chased died.
Recursion of arbitrary depth: The man who [some sentence] is nice.

Here's an example for morphology:

Cats -> cat +N +PL
Cat -> cat +N +SG
Cities -> city +N +PL
Merging -> merge + V +present-participle
Caught -> catch +V +past-participle

Why? Spelling correction... translation (querer -- "want" -- can be realized as quiero or quieres)... hyphenation... pronunciation (hothouse)... etc... Some languages do in morphology what English does in syntax:

uyuyorum I am sleeping
uyuyorsun you are sleeping
uyuyor he/she/it is sleeping
uyuyoruz we are sleeping
uyuyorsunuz you are sleeping
uyuyorlar they are sleeping
uyuduk we slept
uyudukça as long as (somebody) sleeps
uyumalıyız we must sleep
uyumadan without sleeping
uyuman your sleeping
uyurken while (somebody) is sleeping
uyuyunca when (somebody) sleeps
uyutmak to cause somebody to sleep
uyutturmak to cause (somebody) to cause (another) to sleep
uyutturtturmak to cause (somebody) to cause (some other) to cause (yet another) to sleep
... and so on

In this case, we don't want to just recognize, we want to transduce (aka "map"). So replace our transition symbols with input:output pairs. We can then think of this machine as doing any of the following:

Recognizing pairs of strings
Mapping input string to (sets of) output strings
Mapping output strings to (sets of) input strings

See the book for some examples of how to do simple English morphology with such an approach.

Don't discount this approach for being too simple, either... people have done the following with finite state transducers:

Transliteration
Summarization
Translation
Question answering
Dialogue agents
Parsing (needs some extensions)
... pretty much everything :)

02 September 2010

Python Tutorial

For those of you who'd like to learn python, I've used the following tutorial for previous classes I've taught, and I think it's pretty good (I didn't write it: I stole most of it from some friends at Berkeley): http://www.cs.utah.edu/~hal/courses/2010S_AI/p0/p0.html

01 September 2010

HW01 is posted

(Reminder: if you need an extension for Rosh Hashanah, please let me know by this coming Tuesday, the 7th.)

Lecture 2: NLP Applications and Some History

There are loads of applications of NLP technology; we'll talk about some in more detail at the tail end of the course.
One of the first applications was translation

Perhaps one of the first CS problems
Wanted to translate Russian to English during the cold war
Primary trend: think of translation as a problem of code breaking
This eventually led to information theory

Claude Shannon's noisy channel model
Shannon's notion of entropy and the Shannon game

At the same time, CS was being developed, and so formal models like automata became very popular for explaining natural languages, since they were so good at explaining formal languages.

In the 60s, there was a split into the AI camp (which was primarily symbolic) and the statistical camp (which was mostly EE folks working on speech or OCR).

In 1964, the ALPAC report killed much funding of AI research in the US. This, together with Chomsky's anti-statistics view (1959) led to a decade where a lot of this stuff was not mainstream.

In the 70s and early 80s, the IBM folks (statisticians) pushed statistical speech recognition using hidden Markov models. Logic arose as a language for talking about grammar (eg., LFG, which we'll see later). NLU came about, beginning with a blocks world ("put the red block next to the green one") and then later the Yale school (Schank and colleagues) with conceptual knowledge. This was also when lots of work was being done on discourse.

In the late 80s and early 90s, finite state methods and statistics made a comeback, led largely by the translation efforts at IBM, and the work on morphology and phonology by Kaplan and Kay, and syntax by Church. This was the beginning of the return of empiricism.

Nowadays, we have tons of data and empiricism is at it's peak (maybe???) and there's a lot you can do when you have lots of data. We'll see some examples in class for several of the linguistic phenomena we talked about last time.

Other big applications:

Automatically summarizing documents, or collections of documents
Web search (only some people consider this true NLP)
Question answering
Dialogue systems
Speech recognition
Natural language generation
Sentiment/opinion analysis
... lots more ...

HW00

Sorry, meant to post this earlier. There will usually be an announcement on the blog when each homework and project is posted. This gives you a spot to ask questions about them.

Since I didn't post this earlier, I got the following question by email, which I'll answer here:

I was looking at HW00 and I am unsure what you meant by 'clause structure ambiguity'?

Clause structure ambiguity is syntactic ambiguity: the "Structural ambiguity" part from the Lecture 1 post, or the woman on the hill with the telescope example from class.

As an aside, if you have difficulty submitting HW00 (i.e., the system doesn't work), please email me. HW00 is the only homework we'll accept late, but only because of bugs on my end, not because of bugs on your end :). But if you weren't registered for the class when I created the database for the handin script, then it'll probably reject you. So just email me with your uID and name, and I'll add you manually.