Please share with graduate and undergraduate students looking for summer internships. The Johns Hopkins University Human Language Technology Center of Excellence (COE) is seeking candidates for our summer internship program as part of SCALE, our yearly summer workshop (Summer Camp for Advanced Language Exploration.) Interns will work on research in speech, text and graph processing as part of a larger workshop team. Internship positions are fully funded, including travel, living expenses and stipend. This summer's workshop topic is "Combining Speech, Text and Graphs," which will draw from a number of research areas: *** Natural Language Processing *** - machine translation and NLP on informal texts - opinion extraction from informal texts - topic models - information extraction - domain adaptation and transfer learning - multi-lingual learning *** Speech *** - topic detection - keyword spotting - resource constrained speech processing - robust speech processing *** Graphs *** - finding communities in social networks - anomaly detection - learning on large scale graphs - graph-based topic modeling Candidates should be currently enrolled in an undergraduate or graduate degree program. Applications submitted by Friday Jan 14, 2011 will receive full consideration. For more information: http://web.jhu.edu/HLTCOE/scaleinterns2011.html Applicants will be required to obtain a US security clearance, which requires US citizenship. If you do not already have a clearance, we will work with you to obtain one.
22 December 2010
Summer internships at JHU/COE
... in case anyone is reading this, I just got the following email. I know this program and it's good.
21 December 2010
Grades (Almost) Posted, Semester Over
Hi all --
I figure you're likely to read to the end to find out about grades, so before I get to that, let me just take this chance to say that I really enjoyed this class this semester. You were all great. Everyone did awesome on both the final exam and final projects, and I'm really thrilled. If you couldn't tell already, I love language stuff and I encourage you all to continue on and learn everything there is to know. Philip is teaching the follow-on course in the Spring, which should be awesome. I'm also running an unofficial seminar on "Large Data" stuff in the spring; you can get more info here (sign up for the mailing list if you're interested). Anyway, I had a great time teaching; I hope you had fun in class.
Regarding grades, I need to submit them by midnight tonight. And since I don't plan on staying up until midnight, this really means tonight by about 11p.
I've posted "unofficial" grades on grades.cs.umd.edu, so you can see what your grade is. Note that the "total" column on that spreadsheet is completely misleading, since it doesn't include all the weirdo grading rules (dropping of worst projects/homeworks, inclusion of extra credit, etc.). I have all the numbers in a separate spreadsheet, so if something looks odd to you and you'd like the full computation, please let me know. It's of course possible to change grades later, but it's a pain, so I'd rather hear about any issues now.
That's it. Have a great break and I hope to see some of you again in the Spring!
-h
ps., per second comment below, I added an extra column, MyOverall. The grading is as follows:
98 = A+
95 = A
92 = A-
88 = B+
85 = B
82 = B-
78 = C+
75 = C
72 = C-
Note that your score will be exactly one of these numbers: This is just my way of encoding your grade. This isn't actually what your score was :).
I figure you're likely to read to the end to find out about grades, so before I get to that, let me just take this chance to say that I really enjoyed this class this semester. You were all great. Everyone did awesome on both the final exam and final projects, and I'm really thrilled. If you couldn't tell already, I love language stuff and I encourage you all to continue on and learn everything there is to know. Philip is teaching the follow-on course in the Spring, which should be awesome. I'm also running an unofficial seminar on "Large Data" stuff in the spring; you can get more info here (sign up for the mailing list if you're interested). Anyway, I had a great time teaching; I hope you had fun in class.
Regarding grades, I need to submit them by midnight tonight. And since I don't plan on staying up until midnight, this really means tonight by about 11p.
I've posted "unofficial" grades on grades.cs.umd.edu, so you can see what your grade is. Note that the "total" column on that spreadsheet is completely misleading, since it doesn't include all the weirdo grading rules (dropping of worst projects/homeworks, inclusion of extra credit, etc.). I have all the numbers in a separate spreadsheet, so if something looks odd to you and you'd like the full computation, please let me know. It's of course possible to change grades later, but it's a pain, so I'd rather hear about any issues now.
That's it. Have a great break and I hope to see some of you again in the Spring!
-h
ps., per second comment below, I added an extra column, MyOverall. The grading is as follows:
98 = A+
95 = A
92 = A-
88 = B+
85 = B
82 = B-
78 = C+
75 = C
72 = C-
Note that your score will be exactly one of these numbers: This is just my way of encoding your grade. This isn't actually what your score was :).
14 December 2010
12 December 2010
Interested in Language Science?
Just got the following email from Colin Philips in Linguistics / Cognitive Neuroscience. This is regarding language science. Please see below... feel free to email me if you have questions:
I'm hoping that you can help us to reach students in the CS/iSchool universe who might be interested in taking advantage of our unique interdisciplinary language science opportunities. We're particularly interested in reaching 1st and 2nd year students. We'll be holding an informational meeting for students tomorrow at 1pm in 1108B Marie Mount, but I'd be happy to meet at another time with anybody who is interested but not available at that time. We'll tell students about the opportunities and benefits, and also talk about the resources that are available to help them, including new plans to help them to develop interdisciplinary training plans that are both innovative and feasible. Csilla Kajtar already circulated a message about this, but we know that people often just ignore messages sent to mailing lists.
As you know, the closer integration of NLP and cognitive work in language is right at the top of our list of opportunities-that-we'd-be-idiots-not-to-pursue, and student training is one of the best ways to achieve this.
09 December 2010
Final Exam, Due Dec 17, 3:30pm
Here's a copy of the final exam as well as the source LaTeX. Please feel free to either print it and do it by hand, or to do it in LaTeX and print the solution. You may turn it in any time between now and 3:30pm on Dec 17. (Because our official exam time is 1:30-3:30 on Dec 17.) Please hand it in in one of three ways: (1) give it to me in person in my office or otherwise; (2) slide it completely under my office door (AVW 3227); (3) give it to Amit in person.
If you have any clarification questions, please post them here.
If you have any clarification questions, please post them here.
06 December 2010
P4 deadline pushed back to Dec 14
The 9th is apparently the deadline for the ML project.
05 December 2010
P4: Small error in example numbers....
There are some sentences in the training data that contain a "TAB" character. The reasonable thing to do would be just to consider this as whitespace. For some reason I didn't do this. In my example of DF computation, I did this. Which somewhat changes all the remaining numbers.
Instead of rerunning everything I'll just tell you what the updated top frequency words are if you do it "properly." In general, for this assignment, don't worry if your numbers are slightly different than mine -- it may have to do with how you handle the non-ascii characters that appear once in a while in the data.
Instead of rerunning everything I'll just tell you what the updated top frequency words are if you do it "properly." In general, for this assignment, don't worry if your numbers are slightly different than mine -- it may have to do with how you handle the non-ascii characters that appear once in a while in the data.
2999 . 2999 , 2998 of 2997 the 2997 and 2994 in 2989 to 2988 a 2885 as 2875 by 2862 for 2860 ) 2859 ( 2836 with 2832 that 2801 '' 2788 `` 2759 on 2717 from
Last seminar of the semester: Michael Paul Dec 8, 11am
December 8: Michael Paul: Summarizing Contrastive Viewpoints in Opinionated Text
AVW 2120 Performing multi-document summarization of opinionated text has unique challenges because it is important to recognize that the same information may be presented in different ways from different viewpoints. In this talk, we will present a special kind of contrastive summarization approach intended to highlight this phenomenon and to help users digest conflicting opinions. To do this, we introduce a new graph-based algorithm, Comparative LexRank, to score sentences in a summary based on a combination of both representativeness of the collection and comparability between opposing viewpoints. We then address the issue of how to automatically discover and extract viewpoints from unlabeled text, and we experiment with a novel two-dimensional topic model for the task of unsupervised clustering of documents by viewpoint. Finally, we discuss how these two stages can be combined to both automatically extract and summarize viewpoints in an interesting way. Results are presented on two political opinion data sets. This project was joint work with ChengXiang Zhai and Roxana Girju. Bio: Michael Paul is a first-year Ph.D. student of Computer Science at the Johns Hopkins University and a member of the Center for Language and Speech Processing. He earned a B.S. from the University of Illinois at Urbana-Champaign in 2009. He is currently a Graduate Research Fellow of the National Science Foundation and a Dean's Fellow of the Whiting School of Engineering.
02 December 2010
Lecture 25: Mapping Text to Actions
There has been a bunch of work recently on trying to automatically find relationships between language and the "real world", where "real world" actually often means some sort of simulated environment. Here are a few papers along these lines:
In the first paper, which is the one we'll talk about most, the key idea is that of hierarchical plans, represented as a pcfg. For instance we might have a rule "OfferCup -> PickUpCup MoveCup ReleaseCup", where each of the subactions might either be atomic (correspond to actual muscle movements) or might itself be broken down further. (Qustion: how context free is this problem?)
The key ambiguity is due to the fact that actions do not select for exactly one interpretation, as in the Blicket example.
In this paper, they hand constructed a PCFG for actions and the key learning question was whether you could figure out the level of ambiguity automatically. The basic idea is to look at relative frequencies of occurance between lexical items and nodes in the PCFG tree for the actions.
- Fleischman, M. B. and Roy, D. Intentional Context in Situated Language Learning. Ninth Conference on Computational Natural Language Learning , Ann Arbor, MI. June 2005.
- Learning to Connect Language and Perception [Abstract] [PDF]
Raymond J. Mooney
In Proceedings of the 23rd AAAI Conference on Artificial Intelligence (AAAI), Senior Member Paper, Chicago, IL, pp. 1598-1601, July 2008. - S.R.K. Branavan, Harr Chen, Luke Zettlemoyer and Regina Barzilay
"Reinforcement Learning for Mapping Instructions to Actions",
Proceedings of ACL, 2009. Best Paper Award - Learning semantic correspondences with less supervision.
Percy Liang, Michael I. Jordan, Dan Klein.
Association for Computational Linguistics and International Joint Conference on Natural Language Processing (ACL-IJCNLP), 2009. - Adam Vogel and Dan Jurafsky. 2010. Learning to Follow Navigational Directions. In Proceedings of ACL-2010, Uppsala, Sweden. [PDF]
In the first paper, which is the one we'll talk about most, the key idea is that of hierarchical plans, represented as a pcfg. For instance we might have a rule "OfferCup -> PickUpCup MoveCup ReleaseCup", where each of the subactions might either be atomic (correspond to actual muscle movements) or might itself be broken down further. (Qustion: how context free is this problem?)
The key ambiguity is due to the fact that actions do not select for exactly one interpretation, as in the Blicket example.
In this paper, they hand constructed a PCFG for actions and the key learning question was whether you could figure out the level of ambiguity automatically. The basic idea is to look at relative frequencies of occurance between lexical items and nodes in the PCFG tree for the actions.
01 December 2010
P4, grading rules
So P4 has been posted for a while. It is "optional" in the sense that your project grades will be based on your best three out of four grades. In particular, here's what will happen.
Suppose that your grades on P1, ..., P4 are a,b,c,d. (If you don't do P4, then d=0.)
Let x = [ (a + b + c + d) - min { a, b, c, d } ] / 3
Then x is your average grade on your best three projects.
We will use x as your overall project grade (i.e., since each project is weighed equally, it will be like you got a score of x on all FOUR of them).
Suppose that your grades on P1, ..., P4 are a,b,c,d. (If you don't do P4, then d=0.)
Let x = [ (a + b + c + d) - min { a, b, c, d } ] / 3
Then x is your average grade on your best three projects.
We will use x as your overall project grade (i.e., since each project is weighed equally, it will be like you got a score of x on all FOUR of them).
30 November 2010
Lecture 24: Information Extraction
Information extraction is, roughly, the task of going from unstructured text (aka text) to structured data. Think of it as mapping language to a database.
One of the more famous IE tasks is identifying terrorist events (specifically South American terrorist events) in documents. For each event, we have to identify the victim(s), date, type of event (bombing, etc.), culprits, and so on. These fields define our extraction template and our goal is to fill it up based on a document. Of course some documents mention no terrorist events and some mention multiple. And not all fields will be mentioned. This data was available for the MUC (message understanding conference) competitions two decades ago and current approaches still only get about 50% accuracy!
One way of going about this problem is as a sequence labeling task, akin to NER. Since you can imagine how this works, we'll talk about the other style of approach: pattern-based methods.
The idea of pattern-based approaches is that our system consists of a collection of extraction patterns, of the form "<Subj> was assassinated" => Victim. These lexico-syntactic patterns tell us how and when to extract certain slots (aka fields) from text. The expressiveness of patterns depends entirely on how much preprocessing you want to do, but usually some sort of syntactic processing is assumed.
The key question is: where do these patterns come from? The trend in IE has been to move toward mostly unsupervised approaches that don't need large amounts of training data. Successful approaches are akin to the Yarowsky algorithm for WSD.
Suppose we had labeled data, where for each document we have one (for simplicity) partially filled template. We can go into the document and find occurrences of the strings in that template as the things we want to extract. For each of these, we can find a bunch of extraction patterns that would potentially extract that string and collect them over the whole data set. We now need to find the "best" ones. A common metric is the "R log F" metric, which is simply the probability that a given pattern extracts the right slot, times log of the frequency of that pattern. The "log F" term is in there because we want to make sure that we get good coverage.
Of course, you needn't start with labeled data. You can start with small lists of slot fillers (eg., Al Queda as a perpetrator and so on) and bootstrap away. As always, the quality of your seeds directly affects how well your algorithm works.
One can get even more unsupervised by doing the following. Take a collection of documents that talk about terrorist events, and a collection of documents that don't. Look for patterns in the terrorist events collection that are significantly more common there, than in the other collection. Rank these by something like "R log F". The top patterns there are often very good extraction patterns, but we don't know what they are supposed to extract. Have a human look down the list of the top 100 and viola, you're done, and it only takes a few minutes.
Most approaches to IE fall into one of these two camps: sequence labeling or pattern based approaches. It seems that sequence labeling approaches work well when most of the words in the text are extracted for something (i.e., turning free text citations into bibtex entries), but pattern based approaches work well for needle in a haystack problems.
There have been a few recent trends in IE:
One of the more famous IE tasks is identifying terrorist events (specifically South American terrorist events) in documents. For each event, we have to identify the victim(s), date, type of event (bombing, etc.), culprits, and so on. These fields define our extraction template and our goal is to fill it up based on a document. Of course some documents mention no terrorist events and some mention multiple. And not all fields will be mentioned. This data was available for the MUC (message understanding conference) competitions two decades ago and current approaches still only get about 50% accuracy!
One way of going about this problem is as a sequence labeling task, akin to NER. Since you can imagine how this works, we'll talk about the other style of approach: pattern-based methods.
The idea of pattern-based approaches is that our system consists of a collection of extraction patterns, of the form "<Subj> was assassinated" => Victim. These lexico-syntactic patterns tell us how and when to extract certain slots (aka fields) from text. The expressiveness of patterns depends entirely on how much preprocessing you want to do, but usually some sort of syntactic processing is assumed.
The key question is: where do these patterns come from? The trend in IE has been to move toward mostly unsupervised approaches that don't need large amounts of training data. Successful approaches are akin to the Yarowsky algorithm for WSD.
Suppose we had labeled data, where for each document we have one (for simplicity) partially filled template. We can go into the document and find occurrences of the strings in that template as the things we want to extract. For each of these, we can find a bunch of extraction patterns that would potentially extract that string and collect them over the whole data set. We now need to find the "best" ones. A common metric is the "R log F" metric, which is simply the probability that a given pattern extracts the right slot, times log of the frequency of that pattern. The "log F" term is in there because we want to make sure that we get good coverage.
Of course, you needn't start with labeled data. You can start with small lists of slot fillers (eg., Al Queda as a perpetrator and so on) and bootstrap away. As always, the quality of your seeds directly affects how well your algorithm works.
One can get even more unsupervised by doing the following. Take a collection of documents that talk about terrorist events, and a collection of documents that don't. Look for patterns in the terrorist events collection that are significantly more common there, than in the other collection. Rank these by something like "R log F". The top patterns there are often very good extraction patterns, but we don't know what they are supposed to extract. Have a human look down the list of the top 100 and viola, you're done, and it only takes a few minutes.
Most approaches to IE fall into one of these two camps: sequence labeling or pattern based approaches. It seems that sequence labeling approaches work well when most of the words in the text are extracted for something (i.e., turning free text citations into bibtex entries), but pattern based approaches work well for needle in a haystack problems.
There have been a few recent trends in IE:
- Using Wikipedia infoboxes as training data
- Trying to extract knowledge without pre-defined templates (akin to our discussion of mining world knowledge)
23 November 2010
Lecture 23: Rhetorical Structure Theory
So far we've seen flat representations of discourse. RST is an example of a hierarchical discourse representation. Such as:
Here, we've broken some imagined text into 7 "units" and depicted the role of these units in the text. As is implied by the title, this theory is mostly applicable to rhetoric, which is essentially persuasive language.
An RST structure is essentially a dependency tree over elementary discourse units (EDUs), where relations on the edges of the tree tell us the relationship between two EDUs. In the above example, we're saying that EDUs 1-3 provide background to EDUs 4-7. And so on down the tree.
For most relations, there is a distinction between the nucleus and satellite of that relation: basically this is just like headedness in syntax. The nucleus contains the important stuff.
The relations in RST are given by communicative intent. This is a big separation between RST and other theories of discourse.
Here is an example of the evidence relation:
For instance, concession is often identified by the word "although." And "Evidence" is often identified by "For instance." And "Elaboration" is often identified by "And." And so on.
One clever idea a few years ago was to try to mine lexical relations that are indicative of discourse structure. For example, I can find all sentences that begin "for example" and look at that sentence, and the preceding sentence. I can assume that this is an example of Evidence, and then look at features of those two sentences to try to figure out why this is an evidence relation. Then, in the future, when I see sentences that don't have this lexical cue, I can apply whatever I've learned.
The hope is that if you mine contrast relations, you can find contrasting pairs like love/hate or iPhone/Android or whatever. As was shown in the assigned paper for today, that didn't work particularly well. My feeling is that lexical information is not "deep" enough to really get you to discourse except in very simple cases (see that post I linked to before).
Here, we've broken some imagined text into 7 "units" and depicted the role of these units in the text. As is implied by the title, this theory is mostly applicable to rhetoric, which is essentially persuasive language.
An RST structure is essentially a dependency tree over elementary discourse units (EDUs), where relations on the edges of the tree tell us the relationship between two EDUs. In the above example, we're saying that EDUs 1-3 provide background to EDUs 4-7. And so on down the tree.
For most relations, there is a distinction between the nucleus and satellite of that relation: basically this is just like headedness in syntax. The nucleus contains the important stuff.
The relations in RST are given by communicative intent. This is a big separation between RST and other theories of discourse.
Here is an example of the evidence relation:
- The program as published for the calendar year 1980 really works.
- In only a few minutes, I entered all the figures from my 1980 tax return and got a result which agreed with my hand calculations to the penny.
- constraints on N: R might not believe N to a degree satisfactory to W
- constraints on S: R believes S or will find it credible
- constrains on N+S: R's comprehending S increases R's belief of N
- the effect: R's belief of N is increased
- locus of effect: N
- Concern that this material is harmful to health or the environment may be misplaced.
- Although it is toxic to certain animals,
- evidence is lacking that it has any serious long-term effect on human beings.
- constraints on N: W has positive regard for the situation presented in N
- constraints on S: W is not claiming that the situation presented in S doesn't hold
- constraints on N+S: W acknowledges a potential or apparent incompatibility between the situations presented in N and S; W regards the situations presented in N and S as compatible; recognizing that the compatibility between the situations presented in S and S increases R's positive regard for the situation presented in N
- the effect: R's positive regard for the situation presented in N is increased
- locus of effect: N and S
- Circumstance
- Solutionhood
- Elaboration
- Background
- Enablement and Motivation
- Evidence and Justify
- Relations of Cause
- Antithesis and Concession
- Condition and Otherwise
- Interpretation and Evaluation
- Restatement and Summary
- Sequence
- Contrast
For instance, concession is often identified by the word "although." And "Evidence" is often identified by "For instance." And "Elaboration" is often identified by "And." And so on.
One clever idea a few years ago was to try to mine lexical relations that are indicative of discourse structure. For example, I can find all sentences that begin "for example" and look at that sentence, and the preceding sentence. I can assume that this is an example of Evidence, and then look at features of those two sentences to try to figure out why this is an evidence relation. Then, in the future, when I see sentences that don't have this lexical cue, I can apply whatever I've learned.
The hope is that if you mine contrast relations, you can find contrasting pairs like love/hate or iPhone/Android or whatever. As was shown in the assigned paper for today, that didn't work particularly well. My feeling is that lexical information is not "deep" enough to really get you to discourse except in very simple cases (see that post I linked to before).
22 November 2010
P3 grading updated, deadline Wed 24th, 5p
Looks like I underestimated the difficulty of the gender classification. I've adjusted the scoring to be easier on you. The new scoring is:
- 35 < e < 37 : 10%
- 34 < e < 35 : 25%
- 33 < e < 34 : 32%
- 32 < e < 33 : 34%
- 31.5 < e < 32 : 36%
- 31 < e < 31.5 : 37%
- 30.5 < e < 31.0 : 38%
- 30 < e < 30.5 : 39%
- e < 30: 40%
18 November 2010
Lecture 22: Document Coherence
Documents are not just collections of sentences. Sentences serve purposes, and well-written document puts its sentences together well.
There are (at least) two notions of "well":
This is a lot like syntax/semantics. A document can be cohesive but still make no sense. This is like syntax. On the other hand, a document can be meaningful, but still not feel like the sentences go together properly. This is like semantics.
TextTiling is an algorithm/model for discovering cohesive subparts (despite what the original paper says). This is basically a partitioning problem. We're given a text as a sequence of sentences, and we want to break it into pieces, each of which is internally cohesive. The algorithm looks roughly like:
We then define the splits as the points where the similarities drop below some threshold.
One major (linguistic) weakness of this approach is its inability to handle synonymy, etc. When we talk about lexical repetition for cohesion, we also typically include things like "train / car / engine" even though their not the same word.
Argumentative Zoning is more of a coherence-type model (though it's not precisely clear that it's only in one of the two categories). It is primarily a model for research papers, where we talk about the role that each sentence plays in a paper. Like TextTiling, the model is also flat: a document is just a sequence of sentences and we're going to assign each sentence to a class. The three major classes are { Background, Other, Own }.
The original approach to AZ is to annotate data, and then train a classifier. The features used break down into a handful of categories:
There has been a lot of interesting work in this area since then, including things like sentiment analysis of citations, better features, better learning, etc...
There are (at least) two notions of "well":
- Coherence: This is what makes a text meaningful. This is the idea that the semantic interpretation of the sentences in a text should fit together into a larger picture.
- Cohesion: This is what makes a text hang together. This is the idea that sentences help you understand the relationship between what came before and what is coming later.
This is a lot like syntax/semantics. A document can be cohesive but still make no sense. This is like syntax. On the other hand, a document can be meaningful, but still not feel like the sentences go together properly. This is like semantics.
TextTiling is an algorithm/model for discovering cohesive subparts (despite what the original paper says). This is basically a partitioning problem. We're given a text as a sequence of sentences, and we want to break it into pieces, each of which is internally cohesive. The algorithm looks roughly like:
- For each sentence, collect all the word (stems) that occur in that sentence.
- Define the similarity between any two token sequences (subsequences of the text) as the cosine similarity between their stem vectors.
- For each position i in the text, compute the similarity between the block i-21:i-1 and i:i+20. (20 is arbitrary.)
We then define the splits as the points where the similarities drop below some threshold.
One major (linguistic) weakness of this approach is its inability to handle synonymy, etc. When we talk about lexical repetition for cohesion, we also typically include things like "train / car / engine" even though their not the same word.
Argumentative Zoning is more of a coherence-type model (though it's not precisely clear that it's only in one of the two categories). It is primarily a model for research papers, where we talk about the role that each sentence plays in a paper. Like TextTiling, the model is also flat: a document is just a sequence of sentences and we're going to assign each sentence to a class. The three major classes are { Background, Other, Own }.
The original approach to AZ is to annotate data, and then train a classifier. The features used break down into a handful of categories:
- explicit structure (where is this sentence in the document)
- relative location (what %age through the document is this sentence)
- citation (to self versus to others)
- syntax (tense, aspect, voice, negation, etc.)
- semantic (type of verb, etc.)
- content (word overlap with title, etc.)
There has been a lot of interesting work in this area since then, including things like sentiment analysis of citations, better features, better learning, etc...
16 November 2010
Lecture 21: Local Discourse Context
Coreference analysis means figuring out the meaning of pronouns in sentences like: "John saw him in the mirror" versus "John saw himself in the mirror." That's anaphor resolution. More generally, "John saw Bill and said 'My friend, you look lovely today.'" we need to figure out that My=John and friend=Bill.
Generally we care about three "mention" types:
There are lots of signals of coreference:
Generally we care about three "mention" types:
- Pronouns (he, she, it, I, etc...)
- Proper nouns (John, Bill, Congress, the White House, etc...)
- Common nouns (the soccer player, the man, etc.)
There are lots of signals of coreference:
- Apposition: "John, mayor of Tinyville, is happy." Here, it's pretty clear that John=mayor
- Syntactic structure with pronouns/-self. "John saw him" versus "... himself"
- Gender / number / type agreement. "He" is probably not "Mary" and "him" is probably not "players". Similarly, "he" is probably not "the White House". (Mr. and Mrs. are helpful here.)
- Locality. "He" probably refers back to a nearby referent, not something far away.
- Discourse focus. New entities are likely to transition from being objects to being subjects, not the other way around.
- Sub/super-string matching. "Bill Clinton" and "Clinton" are likely coreferent.
- New entities are often introduced with indeterminant NPs ("a player") and then later with determinant NPs
- World knowledge: sometimes we just know that Obama is the President
- Named to named: very easy, 95% accuracy just using substring matching
- Named to pronoun: pretty easy, 85-90% using Hobbs' heuristics, locality and gender
- Named to common: very hard (often need world knowledge)
- Common to common: very hard (often need lexical semantics)
- Common to pronoun: don't bother -- do common to named and named to pronoun
- You then need to go back and fix things up to make sure that transitivity is enforced; or use left-to-right search
- Number of -ve examples >> number of +ve examples; often subsample the negatives
- These decisions aren't independent
- Metonymy: when we say "the White House" meaning "Obama"
- Common nouns that aren't mentions of an entity: "This life is wonderful"
- Quantified pronouns: "Every man loves his mother" (might or might not be coreferent)
- World knowledge
15 November 2010
P2 is mostly graded...
If you'd like your partial score, please send me an email with the ID of the person who handed it in. I would love to just spam you all by sending your report to the hander-in at their id, but I don't know how to do that automatically. I tried sending an email to myid@umd.edu but it bounced... if anyone knows how to do that instead, let me know.
Overall people seemed to do quite well.
Overall people seemed to do quite well.
14 November 2010
TURKERS Final Project
Looks like there's no time tomorrow (Monday) that works for everyone. Please vote on your time preference, trying to be as flexible as possible, and figuring that we'll spend 30 minutes. Also, please list your email address in your signup so I can contact you! :)
11 November 2010
HW10 posted
and HW09 deadline pushed back to Sunday night
Lecture 20: World Knowledge from Text
Today we'll talk about the DIRT algorithm and related ideas. There are actually tons of names for problems very similar to what we're looking at: discovering inference rules, (automatic) paraphrasing, semantic parsing, etc. (The last one is a bit of a stretch, IMO.)
But the whole idea is to identify cases of substitutability. For example, knowing that "X wrote Y" means roughly the same thing as "X is the author of Y", we can potentially do a better job answering questions like "Who wrote the Star Spangled Banner?"
The main "magic" we'll employ to get this to work is the Distributional Hypothesis, which is one of those things that no one cites anymore because everyone knows it. (The citation is: Harris, Z. 1985. Distributional Structure. In: Katz, J. J. (ed.) The Philosophy of Linguistics. New York: Oxford University Press. pp. 26-47.... Zellig Harris is a famous linguist!) DH states that words in similar contexts tend to have similar meanings, and is often paraphrased as "you shall know a word by its friends."
Before talking about what DIRT does, we can talk about previous approaches that take a very "neighborly" approach to similarity.
Once we have a dependency tree, the idea is very similar to what we talked about in semantic role labeling: pick two words in a sentence and look at the path that connects them. There's a good example in the paper on p2.
The DIRT algorithm now works by a process similar to the above.
But the whole idea is to identify cases of substitutability. For example, knowing that "X wrote Y" means roughly the same thing as "X is the author of Y", we can potentially do a better job answering questions like "Who wrote the Star Spangled Banner?"
The main "magic" we'll employ to get this to work is the Distributional Hypothesis, which is one of those things that no one cites anymore because everyone knows it. (The citation is: Harris, Z. 1985. Distributional Structure. In: Katz, J. J. (ed.) The Philosophy of Linguistics. New York: Oxford University Press. pp. 26-47.... Zellig Harris is a famous linguist!) DH states that words in similar contexts tend to have similar meanings, and is often paraphrased as "you shall know a word by its friends."
Before talking about what DIRT does, we can talk about previous approaches that take a very "neighborly" approach to similarity.
- For each word w in your vocabulary (eg., words that appear >= 100 times)
- Find all contexts in which w appears, where a context is just two words to the left and two words to the right.
- For each context, a b w c d, increment counters as follows:
- c(w,a,left2) ++
- c(w,b,left1) ++
- c(w,c,right1) ++
- c(w,d,right2) ++
- Do this for all words and all counters
- Now, we have a representation of each word in terms of its contexts
- Define the similarity between two words w and w' as, for example, the cosine similarity between their context representations:
- sim(w,w') = <c(w), c(w')> / ( ||c(w)|| ||c(w')|| )
- where <x,y> means dot product, ||x|| means norm (= sqrt(sum_i x_i^2))
- and c(w) is the context vector obtained by looking at all c(w,?,?)
- Say that two words are similar if their similarity crosses some threshold
Once we have a dependency tree, the idea is very similar to what we talked about in semantic role labeling: pick two words in a sentence and look at the path that connects them. There's a good example in the paper on p2.
The DIRT algorithm now works by a process similar to the above.
- For each dependency path p from w to w' in a sentence
- Increment the count of (p, X, w) and (p, Y, w'), indicating that w is playing the role of X in "X wrote Y" and w' is playing the role of Y
- Compute the similarity between any two paths by Eqs (1), (2) and (3) in the paper. Basically first compute similarities between slots and fillers, then between slots and slots, and then between paths and paths.
- Do some heuristicy things to make it run efficiently.
09 November 2010
Lecture 19: Structured Perceptron
Classification (eg., with decision trees or perceptron) is great when we're trying to predict a bit. It's not so great when we're trying to predict a structured object, like a POS sequence or parse tree or Arabic sentence (eg., in translation).
The field of predicting complicated structured outputs is called one of {structured prediction, structured output prediction, predicting structured outputs, etc}. I prefer the first term.
One of the simplest structured prediction problems is sequence labeling. I'm given a sequence of inputs and I want to predict a corresponding sequence of outputs, where there is a one-to-one matching between inputs and outputs. We've already seen POS tagging as an example of this task.
Another example of this task is named entity recognition. Here, the input is a sentence, and the goal is to spot all mentions of "named entities" in the text. Typically we also want to label them from a finite set { Person, Organization, Location, etc... }.
Since names can span more than one token (unlike POS tags), we need a way of encoding this as a sequence labeling problem. A popular encoding (though definitely not the only one) is the BIO encoding: Begin/In/Out. This is probably most clear by an example or two:
Obviously we could attempt to solve this problem with an HMM. The problem is that we might want to include a bunch of additional features about words, their contexts and their spellings, into the model. It's difficult, if not impossible, to do this with an HMM.
So instead, we will use a structured variant of a perceptron.
The high-level idea is fairly straightforward. Instead of extracting features over just the input, we'll extract features over the input and output together. Call this feature vector f(x,y), where x is a complete sentence and y is a complete output. For instance, we might have a feature: how many times is Maryland paired with B-LOC? Or how many times is a capitalized word paired with O? Or how many times did I see B-LOC followed by I-LOC?
The perceptron algorithm will hinge on being able to solve the "argmax" problem. Namely, for a given x (sentence) and given weight vector w, we want to find the y that maximizes the dot product between w and f(x,y). Note that in general this is VERY HARD. However, if all our features obey Markov assumptions, we can use dynamic programming.
Now we can sketch the algorithm:
All of the nice properties of perceptron are maintained for this algorithm. Plus, for sequence labeling, we can solve the argmax problem efficiently using a variant of the Viterbi algorithm.
If you want to know more about structured prediction, I'm giving a guest lecture in Lise Getoor's ML class on Nov 30, 3:30-4:45, in CSIC 1121.
The field of predicting complicated structured outputs is called one of {structured prediction, structured output prediction, predicting structured outputs, etc}. I prefer the first term.
One of the simplest structured prediction problems is sequence labeling. I'm given a sequence of inputs and I want to predict a corresponding sequence of outputs, where there is a one-to-one matching between inputs and outputs. We've already seen POS tagging as an example of this task.
Another example of this task is named entity recognition. Here, the input is a sentence, and the goal is to spot all mentions of "named entities" in the text. Typically we also want to label them from a finite set { Person, Organization, Location, etc... }.
Since names can span more than one token (unlike POS tags), we need a way of encoding this as a sequence labeling problem. A popular encoding (though definitely not the only one) is the BIO encoding: Begin/In/Out. This is probably most clear by an example or two:
- Hal_B-PER is_O at_O UMD_B-ORG today_O
- Barack_B-PER Obama_I-PER is_O President_O of_O the_O United_B-GPE States_I-GPE of_I-GPE America_I-GPE
- College_B-LOC Park_I-LOC Maryland_B-LOC is_O nice_O
Obviously we could attempt to solve this problem with an HMM. The problem is that we might want to include a bunch of additional features about words, their contexts and their spellings, into the model. It's difficult, if not impossible, to do this with an HMM.
So instead, we will use a structured variant of a perceptron.
The high-level idea is fairly straightforward. Instead of extracting features over just the input, we'll extract features over the input and output together. Call this feature vector f(x,y), where x is a complete sentence and y is a complete output. For instance, we might have a feature: how many times is Maryland paired with B-LOC? Or how many times is a capitalized word paired with O? Or how many times did I see B-LOC followed by I-LOC?
The perceptron algorithm will hinge on being able to solve the "argmax" problem. Namely, for a given x (sentence) and given weight vector w, we want to find the y that maximizes the dot product between w and f(x,y). Note that in general this is VERY HARD. However, if all our features obey Markov assumptions, we can use dynamic programming.
Now we can sketch the algorithm:
- Initialize w=<0 0 0 0 ... 0>
- For each intput/output example (x,y)
- Compute current prediction: y' = argmax_y w*f(x,y)
- If y == y', celebrate
- otherwise:
- Update w = w + f(x,y) - f(x,y')
All of the nice properties of perceptron are maintained for this algorithm. Plus, for sequence labeling, we can solve the argmax problem efficiently using a variant of the Viterbi algorithm.
If you want to know more about structured prediction, I'm giving a guest lecture in Lise Getoor's ML class on Nov 30, 3:30-4:45, in CSIC 1121.
08 November 2010
04 November 2010
Lecture 18: Linear Models for Learning
Decision trees are easy to understand, but they often die because they keep splitting up the training data.
Another idea is to think of a classification as being a weighted linear combination of features. In other words, each feature (word) gets a weight. Highly (positive) weighted words are really indicative of class 1, and highly negative weighted words are really indicative of class -1. We add up the weights for all the words that occur and use the sign of this as a prediction.
This is the idea behind the perceptron algorithm, which dates back to the 1960s. It is online, which means that it processes one example at a time. For each example, it makes a prediction. If this prediction is right, it does nothing. If it's wrong, it makes a small update:
It turns out that if your data is such that the there is a set of weights that would enable us to get perfect classifications on the training data, the perceptron will eventually find it! (Of course by that time it may have overfit, so you'll probably want to stop it before it's converged.)
There is a linear algebra interpretation of the perceptron: it's optimizing a weight vector so that w*x is >0 for positive examples and w*x is <0 for negative examples. Here, "*" means "dot product." If you interpret dot products as projections, then it's trying to find a weight vector so that when the data points are projected onto this vector, the positive points all lie to one side, and the negative points all lie to the other side. (Note that the usual pictures you see for this are not representative of NLP applications, where our vectors are usually corners of a very high dimensional hypercube.)
Once you think of it like this, you can think about trying to directly optimize w accoring to some loss function, which leads to a whole host of models, like linear regression, ridge regression, logistic regression, support vector machines, boosting, etc., depending on how you choose your loss function.
The basic idea is to define a cost, something like "cost = 1 if the prediction is wrong, and cost = 0 if the prediction is right." Now, try to find a w that minimizes this cost. Unfortunately, this cost is discrete, which makes it hard (actually NP-hard) to optimize. So we relax it and say something like "cost gets bigger as the prediction gets more wrong" which is now continuous and easy to optimize.
The problem is that big weights are bad news: it means that our classification decision is really sensitive to just a few features. We'd like to keep the weights small. This thinking leads to regularization by penalizing large weights. The standard penalization is to penalize the sum of squared weights, basically because this is continuous and differentiable.
So we're left with a cost that looks like "sum of costs on training examples + lambda * regularizer", where lambda controls the overfitting/underfitting tradeoff.
Another idea is to think of a classification as being a weighted linear combination of features. In other words, each feature (word) gets a weight. Highly (positive) weighted words are really indicative of class 1, and highly negative weighted words are really indicative of class -1. We add up the weights for all the words that occur and use the sign of this as a prediction.
This is the idea behind the perceptron algorithm, which dates back to the 1960s. It is online, which means that it processes one example at a time. For each example, it makes a prediction. If this prediction is right, it does nothing. If it's wrong, it makes a small update:
- For each example (x,y):
- Compute prediction: y' = sum_{words in x} weight(word)
- If y' = y, celebrate and continue
- otherwise we've made an error:
- For each word in x:
- Set weight(word) := weight(word) + y
- (In other words, if y is negative, reduce the weight; if y is positive, increase the weight)
It turns out that if your data is such that the there is a set of weights that would enable us to get perfect classifications on the training data, the perceptron will eventually find it! (Of course by that time it may have overfit, so you'll probably want to stop it before it's converged.)
There is a linear algebra interpretation of the perceptron: it's optimizing a weight vector so that w*x is >0 for positive examples and w*x is <0 for negative examples. Here, "*" means "dot product." If you interpret dot products as projections, then it's trying to find a weight vector so that when the data points are projected onto this vector, the positive points all lie to one side, and the negative points all lie to the other side. (Note that the usual pictures you see for this are not representative of NLP applications, where our vectors are usually corners of a very high dimensional hypercube.)
Once you think of it like this, you can think about trying to directly optimize w accoring to some loss function, which leads to a whole host of models, like linear regression, ridge regression, logistic regression, support vector machines, boosting, etc., depending on how you choose your loss function.
The basic idea is to define a cost, something like "cost = 1 if the prediction is wrong, and cost = 0 if the prediction is right." Now, try to find a w that minimizes this cost. Unfortunately, this cost is discrete, which makes it hard (actually NP-hard) to optimize. So we relax it and say something like "cost gets bigger as the prediction gets more wrong" which is now continuous and easy to optimize.
The problem is that big weights are bad news: it means that our classification decision is really sensitive to just a few features. We'd like to keep the weights small. This thinking leads to regularization by penalizing large weights. The standard penalization is to penalize the sum of squared weights, basically because this is continuous and differentiable.
So we're left with a cost that looks like "sum of costs on training examples + lambda * regularizer", where lambda controls the overfitting/underfitting tradeoff.
02 November 2010
Lecture 17: Classification with Decision Trees
Machine learning is all about predicting the future given the past.
A particular type of machine learning problem that crops up in all sorts of NLP applications is classification. We get an object and try to put it into one of K buckets. For example, spam filtering (K=2) or sentiment detection (K=2) or news versus sports versus entertainment (K=3). When K=2 it's called binary classification. Otherwise, multiclass.
The general setup is:
Key question #2: Why don't I just figure out what the most popular class (y) is and always report that?
Bad news: In general, in theory, learning is impossible.
Good news: In general, in practice, learning is EASY and works even when you might not expect it to.
Key concept is a feature vector. We take our object and extract features of that object, which is the representation that the learning will use. Good "feature engineering" is often worth more than good machine learning. (I say that even as a machine learning person!)
A very simple learning algorithm is a decision tree. It's like playing "20 questions." The idea is to pretend that you have data, and get to pick only one feature to look at. You pick that feature (let's pretend it's binary) and then split your data based on that feature. We now have two data sets. For each of these, we recurse and pick a new feature (it need not be the same on the two branches). At the end, we'll be down to a "leaf" (only one data point left) in which case we don't need to ask any more questions. (Or we'll have run out of features.)
Running the decision tree all the way to the leaves is like storing the examples in a database (why?). Running a zero-level decision tree is like remembering the most popular class (why?).
In general, we want something in between, and to choose the depth heuristically.
To do that, we'll set aside a bit of our training data as "development data" (or "validation data") in order to pick the "best" depth. (Or other stopping criteria.) Important point: why can't I just use the training data?
A particular type of machine learning problem that crops up in all sorts of NLP applications is classification. We get an object and try to put it into one of K buckets. For example, spam filtering (K=2) or sentiment detection (K=2) or news versus sports versus entertainment (K=3). When K=2 it's called binary classification. Otherwise, multiclass.
The general setup is:
- Someone gives us a bunch of examples (x) together with their correct labels (y) -- this is our training data.
- We learn some function h (for "hypothesis", maybe I'll call it f) that maps x to y
- Then at "test time" someone gives us a bunch more (x) and we have to predict the corresponding (y)s
- We are evaluated based on how accurate we were at this prediction
Key question #2: Why don't I just figure out what the most popular class (y) is and always report that?
Bad news: In general, in theory, learning is impossible.
Good news: In general, in practice, learning is EASY and works even when you might not expect it to.
Key concept is a feature vector. We take our object and extract features of that object, which is the representation that the learning will use. Good "feature engineering" is often worth more than good machine learning. (I say that even as a machine learning person!)
A very simple learning algorithm is a decision tree. It's like playing "20 questions." The idea is to pretend that you have data, and get to pick only one feature to look at. You pick that feature (let's pretend it's binary) and then split your data based on that feature. We now have two data sets. For each of these, we recurse and pick a new feature (it need not be the same on the two branches). At the end, we'll be down to a "leaf" (only one data point left) in which case we don't need to ask any more questions. (Or we'll have run out of features.)
Running the decision tree all the way to the leaves is like storing the examples in a database (why?). Running a zero-level decision tree is like remembering the most popular class (why?).
In general, we want something in between, and to choose the depth heuristically.
To do that, we'll set aside a bit of our training data as "development data" (or "validation data") in order to pick the "best" depth. (Or other stopping criteria.) Important point: why can't I just use the training data?
01 November 2010
Owen Rambow talk today IN AVW 3165!!!! 11am.
In the 80s and 90s of the last century, in subdisciplines such as planning, text generation, and dialog systems, there was considerable interest in modeling the cognitive states of interacting autonomous agents. Theories such as Speech Act Theory (Austin 1962), the belief-desire-intentions model of Bratman (1987), and Rhetorical Structure Theory (Mann and Thompson 1988) together provide a framework in which to link cognitive state with language use. However, in general natural language processing (NLP), little use was made of such theories, presumably because of the difficulty at the time of some underlying tasks (such as syntactic parsing). In this talk, I propose that it is time to again think about the explicit modeling of cognitive state for participants in discourse. In fact, that is the natural way to formulate what NLP is all about. The perspective of cognitive state can provide a context in which many disparate NLP tasks can be classified and related. I will present two NLP projects at Columbia which relate to the modeling of cognitive state: Discourse participants need to model each other's cognitive states, and language makes this possible by providing special morphological, syntactic, and lexical markers. I present results in automatically determining the degree of belief of a speaker in the propositions in his or her utterance. Bio: PhD from University of Pennsylvania, 1994, working on German syntax. My office mate was Philip Resnik. I worked at CoGentex, Inc (a small company) and AT&T Labs -- Research until 2002, and since then at Columbia as a Research Scientist. My research interests cover both the nuts-and-bolts of languages, specifically syntax, and how language is used in context.
29 October 2010
28 October 2010
Lecture 16: Computational Semantics
A recent development in NLP-land is the interest in "semantic role labeling." I tend to think of this as "Parsing++", but it's basically a halfway point between "syntax" (which we've already passed a bit) and "semantics" (which we don't know how to do).
The idea in SRL is to find predicates (mostly verbs) and identify the arguments of these verbs (eg., Agents, Patients, Experiencers, Themes, etc.). For example:
The general approach to doing semantic role labeling is a pipeline of "first parse, second find verbs, third find the arguments to those verbs." (There's some argument as to whether parsing is strictly necessary, but it's clearly at least helpful.)
The biggest challenge here is finding the arguments. Basically we'll go over the parse tree, find potential arguments (basically internal nodes in the tree), and then try to use statistical information to figure out which nodes correspond to which arguments (or none).
First, we'll identify all potential argument candidates:
A more sane approach is to use instead (or in addition) tree paths. This idea has come up in a lot of places, and often works pretty well. If we look at the parse of this sentence:
There are lots of ways to represent this path, but a baseline might be to say that the path looks like:
However we encode the path, we now need to estimate something like P(means | path). A naive approach would be to Bayes' rule this to get P(means) * P(path | means) and then we could model the path with a Markov model. Or, if you've taken machine learning, you can use the path as a collection of features in some learning algorithm (or if you're being fancy, turn it into a kernel!). We'll talk about this more when we talk about learning.
As a final note, Matthew Gerber and Joyce Chai got best paper at ACL 2010 for their paper Beyond NomBank: A Study of Implicit Arguments for Nominal Predicates, where they look at similar things for nominal predicates rather than verbal predicates. In this case, I agree: this is a great paper.
The idea in SRL is to find predicates (mostly verbs) and identify the arguments of these verbs (eg., Agents, Patients, Experiencers, Themes, etc.). For example:
- John ate a sandwich
- For the predicate ate, John is the agent and sandwich is the patient
- A sandwich was eaten by John
- Same as above.
- Bill saw John eat a sandwich
- For the predicate ate, it's the same as above
- For the predicate saw, the agent is Bill
The general approach to doing semantic role labeling is a pipeline of "first parse, second find verbs, third find the arguments to those verbs." (There's some argument as to whether parsing is strictly necessary, but it's clearly at least helpful.)
The biggest challenge here is finding the arguments. Basically we'll go over the parse tree, find potential arguments (basically internal nodes in the tree), and then try to use statistical information to figure out which nodes correspond to which arguments (or none).
- John saw a man with a telescope.
First, we'll identify all potential argument candidates:
- John
- a
- man
- a man
- with
- a
- telescope
- a telescope
- with a telescope
- John
- man
- a man
- telescope
- a telescope
- with a telescope
A more sane approach is to use instead (or in addition) tree paths. This idea has come up in a lot of places, and often works pretty well. If we look at the parse of this sentence:
- (S (NP John) (VP (V saw) (NP (DT a) (NN man)) (PP (P with) (NP (DT a) (NN telescope)))))
There are lots of ways to represent this path, but a baseline might be to say that the path looks like:
- up:VP down:3:PP down:2:NP down:2:NN
However we encode the path, we now need to estimate something like P(means | path). A naive approach would be to Bayes' rule this to get P(means) * P(path | means) and then we could model the path with a Markov model. Or, if you've taken machine learning, you can use the path as a collection of features in some learning algorithm (or if you're being fancy, turn it into a kernel!). We'll talk about this more when we talk about learning.
As a final note, Matthew Gerber and Joyce Chai got best paper at ACL 2010 for their paper Beyond NomBank: A Study of Implicit Arguments for Nominal Predicates, where they look at similar things for nominal predicates rather than verbal predicates. In this case, I agree: this is a great paper.
26 October 2010
Lecture 15: Computational Lexical Semantics
Today's topic is word sense disambiguation. This is a very "heated" topic and we'll discuss in class (not on the public blog!) why this might be!
Here are some examples of the verb "drive"... which are different senses?
Two of the most important concepts in WSD are:
Here are some examples of the verb "drive"... which are different senses?
- Can you drive this four-wheel truck?
- We drove to the university every morning.
- She drove me to school every day.
- He drives me mad.
- She is driven by her passion.
- They drove back the invadors.
- She finally drove him to change jobs.
- He drove a nail into the wall.
- The player drove the ball far out into the field.
- She is driving away at her doctoral thesis.
- What are you driving at?
- My new truck drives well.
- He drives for the taxi company in DC.
- The car drove around the corner.
- The farmers drove the cows into the barn.
- We drive the beltway to work.
- He drove a golf ball across the street.
- I hit the ball with a bat and drove it into left field.
- The workers drove a tunnel through the hill.
- These engines are driven by steam.
- The hunter drove the forest looking for deer.
- After finding deer, he drove it into open ground.
Two of the most important concepts in WSD are:
- One sense per discourse (which may or may not be true...)
- One sense per collocation (which also may or may not be true...)
21 October 2010
HW07 is posted
(I'm trying to get better about posting these early, like in the beginning of the semester when I wasn't as behind schedule on everything!)
Lecture 14: Linguistic Challenges
Final project details:
- Groups, just like normal projects.
- Scope: something that with an extra month of work could turn into a paper at a major conference (ACL, EMNLP, NAACL, etc....)
- Four styles of projects:
- Reimplement some cool, interesting system from the past 2 or 3 years in an area of NLP/CL that you think is interesting. (+1 month would be to figure out what it does wrong, propose some new idea, and then get a paper out of it.)
- Take some problem you care about from your own research and look at how NLP can help it. (You're a better judge than I am about turning this into a paper, but at least you care about the topic!)
- Pick some interesting, and under-studied linguistic phenomena (like the things we're talking about today) and do a study on this phenomena. Things not in English are totally allowed, but it would help if you knew the language you were working with. If you want to do data collection using mechanical turk, I have a budget of $200 for this: we can see how many people want to do this and then try to divide fairly (note: fairly is not the same as evenly).
- Something else you can convince me is interesting...
- You will be graded entirely on your writeup, which should look more or less like a 4-6 page ACL workshop paper (you're encouraged to use the ACL style files).
- Metaphor. This can be narrowly construed ("John devoured his steak like a lion") or broadly construed ("John drove Mary crazy"). The usual theory is that there are concrete, physical verbs/nouns; and then when we want to talk about something abstract, we ground it in one of these physical verbs. So a metaphor will usually consist of one abstract thing and one physical thing. There are interesting things to be done both on interpreting metaphors as well as generating metaphors. I have a small amount of data.
- Quantifier degree. I can say "Most men are jerks" and I can say "Most jerks drink beer" but does this mean that most men drink beer (or rather, that I think that most men drink beer?). What about other quantifiers? How consistent is this across different statements ("Most men are jerks" versus "Most animals are invertebrates" -- is the meaning of "most" the same)? I don't have any data, but you could collect it.
- Metonymy. This is when I say "The White House" and really mean "Obama" (or at least I mean "someone from the White House"). This combines with ideas of selectional preferences, where we might be able to figure out that the agent (or subject) of "said" should be animate, and might know that the White House is not animate, and so it must be metonymic. Can we identify metonymic usages? Can we resolve them? I have a reasonable amount of data.
- Quantifier scope. This is a classic semantics problem. "All the women on the island have a husband." Does this mean that each woman has her own husband, or that there's one man who is married to all of the women simultaneously. Similarly, "Each man entered the bar with a dog" -- how many dogs are there? I have no data.
- Negation scope. This is perhaps particularly interesting in the case of sentiment analysis. "I did not like the plot but liked the acting." -- Does "not" scope over just "like the plot" or over the full "like the plot but liked the acting"? There is also an interaction with quantifiers ("No man should marry a leprechaun"), but this is both rare and hard. I have no data, but I think we could find some.
- Language change. We have English documents going back 500 years, but English back then looks a lot different than English now. Moreover, our part of speech taggers and parsers that work well now won't work well on Shakespeare (I would guess :P). Can we track changes back in time, and use this information to get today's models to do a good job way back when? I have a lot of data.
- Traces and Ellipsis. "John went to the store but Jerry didn't ______" -- what was it that Jerry didn't do. This is a form of a trace. The Penn Treebank comes with some but not all traces spelled out. Try to figure out where they should be and what they refer to.
- Exaggeration. No one has ever looked at this problem. Okay that's not true, but only because it's an overstatement. Can we automatically identify overstatements in text? Can we tone them down or at least identify the degree of overstatement? I have no data, but it shouldn't be too hard to find some.
- Emotion from text. Can we identifying character's emotions by looking at what they say, for instance in novels. I have data of quoted speech, paired with emotional strength -- can we predict emotional strength given the quoted speech and perhaps the context in which it was spoken? I have lots of data.
- Language and X. Often we can get language paired with objects in another medium, for instance images and captions. Can we identify correspondences? Or perhaps generate captions from images? We have (or can get) similar data for comments and code, or formal specifications (variant of first order logic) and descriptions. Or database queries and text. I have some data, depending on what you want "X" to be.
- Shannon game (word/character prediction) with domain knowledge.... The Shannon game is a way of measuring entropy of a language. Perhaps make an online game (like hangman?) where we can get people to play this game, and then can also tell them things like "this is a news story" or "this is a children's book" or "this is a medical text" or something and see how it affects their ability to guess the next work.
- "Translating" from tweets to language that your parents could understand. We have lots of SMS/twitter data, and lots of normal text. A lot of this is abbreviation expansion, but some is not. Can we figure out which terms in tweets should be expanded, and then try to find their expansions in normal text? There are at least two types ("something" -> "sth" or "laugh out loud" -> "lol"), but maybe more. We can get lots of data easily.
- Identifying character relationships in literature. We can track character interactions in literature, but the relationship between these characters is not known. And perhaps it changes over time. Can we figure out how two characters in a novel feel about each other on the basis of what they do/say to each other? I have some data and we can get a lot more easily.
- Many other languages have phenomena that English does not. For example, subject dropping happens in lots of languages (eg., Japanese) when it's clear from context. Also stuff in the Semantic Analysis chapter of J+M is pretty much up for grabs.
19 October 2010
Lecture 13: Interpretation as Abduction
"Interpretation as Abduction" is one of the more infuential computational semantics papers from its era (and really, ever).
The basic claim is that through (weighted) abduction, one can explain a whole host of complex linguistic phenomena:
Abduction is the type of inference we try to do in science. We observe something, and we try to posit what caused this to happen. For instance, we may observe that apples fall from trees. We can posit this is due to gravity. Or we can posit that this is due to little elves that hang out in apples and little pots of gold in the earth and the elves are trying to get to the pots of gold. One of these is a more plausible explanation than the other, since it requires us to invent fewer things. (Though as far as I know, no one really knows how gravity works... so maybe it's a bad example... maybe it is elves/gold.) This is often referred to as Occam's Razor: of all explanations, choose the "simplest."
The key idea to use abduction to do sentence interpretation is:
In weighted abduction, we assume that there is a cost for coercions and making assumptions, and that we want to find the set of things we can assume that have the least possible cost.
The idea is that our database might contain entries like:
If we have redundancy... eg., something like "blah blah q(x) and q(y) blah blah" and at some point we derive x=y, then we can merge this to just "blah blah q(x) blah blah" and the cost of q(x) is the minimum of the costs of q(x) and q(y).
For example, we might have:
One kind of cool outcome of this is that you can make "reverse" inferences fairly easily. For instance, maybe I want to be able to assume that sometimes when people talk about animals, they're actually talking about cats. But they're being vague. I can just add a rule:
It turns out (see section 6 in the paper) that you can actually handle all of syntax, semantics and pragmatics in this single framework!
The basic claim is that through (weighted) abduction, one can explain a whole host of complex linguistic phenomena:
- Reference resolution ("John_1 saw himself_1 in the mirror" -- where "John" and "himself" co-refer)
- Interpretation of compound nominals ("computer science" versus "cs department")
- Syntactic ambiguity (eg., PP attachment)
- Metonymy ("The White House announced that..." really means that someone in the White House announced something)
- Etc....
Abduction is the type of inference we try to do in science. We observe something, and we try to posit what caused this to happen. For instance, we may observe that apples fall from trees. We can posit this is due to gravity. Or we can posit that this is due to little elves that hang out in apples and little pots of gold in the earth and the elves are trying to get to the pots of gold. One of these is a more plausible explanation than the other, since it requires us to invent fewer things. (Though as far as I know, no one really knows how gravity works... so maybe it's a bad example... maybe it is elves/gold.) This is often referred to as Occam's Razor: of all explanations, choose the "simplest."
The key idea to use abduction to do sentence interpretation is:
- Prove the logical form of the sentence
- together with the constraints that predicates impose on their arguments (eg., the subject of "run" must be animate)
- allowing for coercions (eg., we might coerce someone who works at the White House into the White House)
- Merging redundancies where possible
- Making assumptions where necessary
In weighted abduction, we assume that there is a cost for coercions and making assumptions, and that we want to find the set of things we can assume that have the least possible cost.
The idea is that our database might contain entries like:
- If P and Q then R
- If P / w1 and Q / w2 then R
If we have redundancy... eg., something like "blah blah q(x) and q(y) blah blah" and at some point we derive x=y, then we can merge this to just "blah blah q(x) blah blah" and the cost of q(x) is the minimum of the costs of q(x) and q(y).
For example, we might have:
- (forall x) car(x) / 0.8 AND notop(x) / 0.4 IMPLIES convertible(x)
One kind of cool outcome of this is that you can make "reverse" inferences fairly easily. For instance, maybe I want to be able to assume that sometimes when people talk about animals, they're actually talking about cats. But they're being vague. I can just add a rule:
- (forall x) animal(x) AND etc(x) IMPLIES cat(x)
It turns out (see section 6 in the paper) that you can actually handle all of syntax, semantics and pragmatics in this single framework!
18 October 2010
15 October 2010
P2 deadline and ML midterm
I've been informed that there's a midterm in Lise's ML class on the 19th, which coincides with the P2 deadline.
To try to help you not lose too much sleep, I've pushed the deadline for P2 back to Friday night at 10p. However, to be fair to those who don't have ML, or who have planned their schedule around the deadline on the 19th, I'll give you 15% extra credit if you hand it in by the original deadline (19th). If you do hand it in early, please make a note in your writeup that you've done so. Hopefully 15% is enough to balance out anything that you would lose by not being able to compete up to the last minute with your classmates on the competition parts.
Also, a note on pruning:
This is now in the .pdf writeup, but:
To try to help you not lose too much sleep, I've pushed the deadline for P2 back to Friday night at 10p. However, to be fair to those who don't have ML, or who have planned their schedule around the deadline on the 19th, I'll give you 15% extra credit if you hand it in by the original deadline (19th). If you do hand it in early, please make a note in your writeup that you've done so. Hopefully 15% is enough to balance out anything that you would lose by not being able to compete up to the last minute with your classmates on the competition parts.
Also, a note on pruning:
This is now in the .pdf writeup, but:
- Be sure that you prune all cells, including the right-most ones. Your labeled F measure for unpruned should be about 78%.
- Don't forget to debinarize before you evaluate!
- If you prune every cell, including the root, with K=2, your labeled F measure should be about 23%. You should have about 83 unparsable sentences.
- If you prune every cell except the root (which is a totally sensible thing to do -- thanks to several of you who have pointed this out -- your labeled F should be about 59%, and you should have about 38 unparsable sentences. (As you can see, it's a really good idea not to prune the root, especially since it doesn't make things any slower -- In fact, it might actually make things faster!)
- You can implement it either of the above two ways -- just be sure to document which you're doing.
- Be sure to prune before applying unary rules. Thus, even with K=2, you will often end up with more than 2 items in a chart cell. However, only 2 of these will have binary derivations; all the rest should have unary derivations.
14 October 2010
Lecture 12: First Order Logic
So far we've talked a lot about syntax. Of course, syntax isn't the end-all be-all of language processing: we want to "understand" language. Often this means mapping natural language sentences to some form of formal language sentences. This means we have to decide on a formal language. A common choice is first order logic (FOL).
Expressions in FOL take one of the following forms:
FOL is a nice language to map into because we know a lot about how to reason in FOL. Reasoning is, of course, hard (take an AI class), but sort of do-able.
The general way in which FOL systems work is:
The common assumption (which is not always true) is that language is compositional. Namely, the meaning of a sentence is derived from the meanings of its parts. For examples:
In order to make this happen, we usually need a semantic grammar that tells us how to get from syntax to semantics. The grammar will consist of rules that are akin to the type of rules we applied to do the semantic analysis of the above example sentence.
Key challenges with FOL:
Expressions in FOL take one of the following forms:
- Atomic symbols like "Hal"
- Relation symbols like "Eats(.,.)", for instance in "Eats(Hal, Sandwich)". Relations can have arbitrary arity (i.e., take arbitrarily many arguments)
- Variables like "a" (see later for quantification)
- Combinations via and, or, not, implies, iff, etc. Eg., Eats(Hal, Sandwich) implies not OnThePlate(Sandwich)
- Existential and universal quantifiers. For example, "(exists a) Eats(Hal, a)" (which roughly means "Hal eats something.")
- Equality symbol. For example, "(exists a) Eats(Hal, a) and (a = Sandwich)"
FOL is a nice language to map into because we know a lot about how to reason in FOL. Reasoning is, of course, hard (take an AI class), but sort of do-able.
The general way in which FOL systems work is:
- We have a bunch of background knowledge expressed as "axioms" -- basically a database of FOL sentences
- We get a text, from which we derive new FOL sentences
- Someone issues a query, which we will try to "prove" given our collection of sentences. A proof of a query is an answer to the question
- Someone might query "Who shot Lincoln?"
- We map this to something like "(exists a) Shot(a, Lincoln)"
- We try to prove this statement given our database, which might include some statement like "Shot(Booth, Lincoln)". Thus, "a=Booth" is a "proof" of our query, and gives us the answer
The common assumption (which is not always true) is that language is compositional. Namely, the meaning of a sentence is derived from the meanings of its parts. For examples:
- Hal ate a sandwich with mayo.
- Parses to (Hal (ate (a sandwich (with mayo))))
- Bottom up, we can look at (mayo) and derive "(exists a) Mayo(a)"
- Then (sandwich (with mayo)) and derive "(exists b) Sandwich(b) and Contains(Sandwich, ___)" where ___ comes from the interpretation of (with mayo), yielding "(exists a,b) Mayo(a) and Sandwich(b) and Contains(a,b)"
- Then (a sandwich ...), takes what we had before and makes "b" unique, usually written as "(exists a, exists!b) ..."
- Then we take Hal and get "Hal"
- And finally, "x ate y" will give us something like "(exists e) Eat(e, x, y) and InThePast(e)" which when we plug in with x and y gives the final interpretation:
(exists a, exists !b, exists e) Mayo(a) and Sandwich(b) and Contains(a,b) and Eat(e, Hal, b) and InThePast(e)
In order to make this happen, we usually need a semantic grammar that tells us how to get from syntax to semantics. The grammar will consist of rules that are akin to the type of rules we applied to do the semantic analysis of the above example sentence.
Key challenges with FOL:
- What are the right sets of symbols and relations?
- Where does the necessary background knowledge come from?
- You can try to get humans to do it for you: see the Cyc project
- Or you can try to extract it automatically: see ReadTheWeb or ODIE or TextRunner, for example
- How to get the semantic grammar
- You can hand code it (see verbnet and the unified verb index for a verb-centric view).
- Or you can try to learn it from examples (see Ray Mooney's talk yesterday)
- Sometimes need more than FOL
- How to handle "Many students love NLP"?
- Or "John did something yesterday" -- for this we may need second order logic, where you can quantify over relations -- "(exists R, e) R(e, John) and InThePast(e)"
12 October 2010
Project 2 warning on pruning results
I had a bug in my pruning, and wasn't pruning the right-most cells. (It was an off-by-one error.) Since English is mostly right-branching, the cells that touch the right edge (i.e., end of sentence) are incredibly important, so not pruning them makes life a lot better than otherwise. I'm working on fixing it right now and checking to make sure there aren't other bugs. I'll post a comment to this post when it's done and I've updated the results in the .pdf, but your pruning results should be much worse than what is in the writeup (i.e., with K=2, you should get no parse on 80-something sentences, and your f-score should be in the 10s or 20s).
Project 2 stuff...
I fixed the P2 pdf (just the one linked on the web page, not the one in the .tar.gz file) so that it includes unary productions in the examples. This is only relevant for part 2 -- nothing else changed.
Also, I'm not sure if I included them previously, but here are the links for the online test scripts:
Also, I'm not sure if I included them previously, but here are the links for the online test scripts:
HW05 is posted
(Sorry for the delay -- there's just one question :P)
Lecture 11: Incorporating Context
Many languages express some form of agreement. In English it is mostly Subject/Verb agreement (i.e., "He eats a sandwich" versus "They eat a sandwich"). In other languages agreement can be more widespread: Adjective/Noun, Determiner/Noun, Verb/Object, Noun/Relative Clause, etc.
Suppose we want to capture this with a (P)CFG.... the only real way to do this is to expand our set of non-terminals. I.e., we might start with a grammar like:
Which will generate our good sentences, but also bad sentences like "He eat a sandwich" and "They eats a sandwich". We can fix this by adding labels to our non-terminals:
Now life is great: we will no longer produce our bogus sentences. (Note that we would also have to account for "a" being singular, etc...)
Of course there are other phenomena we might want to capture, for instance pronouns sitting in object position have to change case: "A sandwich eats him" not "A sandwich eats he". If we ignore the singular/plural distinction, we can solve this separately.
And so on...
We've now solve the nominative/accusative issue, so we can combine the sing/plur grammar with the nom/acc grammar. The problem is exponential blowup: handling one roughly doubled the size of the grammar, handling the other doubles it again. In general, for every agreement-like phenomenon we want to handle, we'll double the size of our grammar again. This is bad (a) because it makes our grammar huge and (b) because we'll need tons and tons and tons of data to get reliable estimates of these probabilities from a treebank!
(Other English phenomena you might want to capture are selectional preferences -- which are really more semantics issues than syntactic issues; or determiner/noun agreement.)
One solution to this problem is to augment our grammar with features. That is, basically do what linguists do and write "+sing" on the pre-terminal, but not split the grammar into bits. Instead, we need to come up with ways of propagating these features up the tree, and checking for disagreement. This process is known as "unification".
The idea of unification is that everything has a bunch of features with values. Values can be atomic (number=sing) or disjunctions (number={sing,plur}) or variables (number=A). Two values unify basically if they have a non-empty intersection (it's actually more complicated than this, but this is the basic idea). So sing and sing unify and produce sing. Similarly sing and {sing,plur} unify and produce sing. A and sing unify and produce sing with a note off on the side of the page that A=sing. sing and plur don't unify. Our grammar will now encode what unifications happen at each step (eg., when producing a S, you want to unify the number feature of the NP and VP; or when producing a VP, you want to unify the object NP with "case=accusative" and likewise for subject NPs).
Our grammar is now a set of context free productions and unification rules. One nice thing is that chart parsing basically still works, but our parse cells become more complicated because they have to remember the values of features. But at least our grammars don't blow up!
Suppose we want to capture this with a (P)CFG.... the only real way to do this is to expand our set of non-terminals. I.e., we might start with a grammar like:
S -> NP VP NP -> Pro NP -> Det Noun VP -> Verb NP Pro -> he Pro -> they Verb -> eat Verb -> eats Det -> a
Which will generate our good sentences, but also bad sentences like "He eat a sandwich" and "They eats a sandwich". We can fix this by adding labels to our non-terminals:
S -> NP+sing VP+sing S -> NP+plur VP+plur NP+sing -> Pro+sing NP+plur -> Pro+plur NP+sing -> Det Noun+sing NP+plur -> Det Noun+plur VP+sing -> Verb+sing NP VP+plur -> Verb+plur NP Pro+sing -> he Pro+plur -> they Verb+sing -> eats Verb+plur -> eat Det -> a
Now life is great: we will no longer produce our bogus sentences. (Note that we would also have to account for "a" being singular, etc...)
Of course there are other phenomena we might want to capture, for instance pronouns sitting in object position have to change case: "A sandwich eats him" not "A sandwich eats he". If we ignore the singular/plural distinction, we can solve this separately.
S -> NP+nom VP VP -> Verb NP+acc NP+nom -> Pro+nom NP+acc -> Pro+acc NP -> Det Noun Pro+nom -> he Pro+acc -> him Pro+nom -> they Pro+acc -> them
And so on...
We've now solve the nominative/accusative issue, so we can combine the sing/plur grammar with the nom/acc grammar. The problem is exponential blowup: handling one roughly doubled the size of the grammar, handling the other doubles it again. In general, for every agreement-like phenomenon we want to handle, we'll double the size of our grammar again. This is bad (a) because it makes our grammar huge and (b) because we'll need tons and tons and tons of data to get reliable estimates of these probabilities from a treebank!
(Other English phenomena you might want to capture are selectional preferences -- which are really more semantics issues than syntactic issues; or determiner/noun agreement.)
One solution to this problem is to augment our grammar with features. That is, basically do what linguists do and write "+sing" on the pre-terminal, but not split the grammar into bits. Instead, we need to come up with ways of propagating these features up the tree, and checking for disagreement. This process is known as "unification".
The idea of unification is that everything has a bunch of features with values. Values can be atomic (number=sing) or disjunctions (number={sing,plur}) or variables (number=A). Two values unify basically if they have a non-empty intersection (it's actually more complicated than this, but this is the basic idea). So sing and sing unify and produce sing. Similarly sing and {sing,plur} unify and produce sing. A and sing unify and produce sing with a note off on the side of the page that A=sing. sing and plur don't unify. Our grammar will now encode what unifications happen at each step (eg., when producing a S, you want to unify the number feature of the NP and VP; or when producing a VP, you want to unify the object NP with "case=accusative" and likewise for subject NPs).
Our grammar is now a set of context free productions and unification rules. One nice thing is that chart parsing basically still works, but our parse cells become more complicated because they have to remember the values of features. But at least our grammars don't blow up!
07 October 2010
Lecture 10: Statistical Parsing
Context free grammars are great, but as with any X, if X is great, then probabilistic X is greater still!
A probabilistic context free grammar (PCFG) is a CFG with "weights" attached to each of the rules. These weights must be non-negative. For any left-hand-side, the sum of weights must be one. The following is a small PCFG:
Note that for any given LHS (eg., "NP") the sum of weights (probabilities) is one.
The probability of a tree is just the product of probabilities of rules used in that tree.
We can generate random sentences with a PCFG by starting at "S" and then picking a rule at random (weighted by the probabilities) and then recursively generating the corresponding children.
The probability of a sentence under this grammar is obtained by summing (marginalizing) out all the trees that can generate that sentence.
To parse with a PCFG, we just parse with the corresponding CFG, but then weigh items in the chart by their probabilities. If we find a higher-probability derivation of the same item in the chart, we throw out the old one and replace it with the new one. In CKY this is easy.
A treebank is a collection of human-labeled trees. Given a treebank, we can extract the corresponding PCFG by counting productions and normalizing. Most trees are not binary so you need to binarize your rules.
Note on smoothing: you have to smooth if you want your grammar to be able to parse out-of-sample trees. Smoothing can be done just like in HMMs, but you want to be careful not to compute stuff that is irrelevant.
Note on efficiency: The grammars we produce are huge (tens or hundreds of thousands of rules). As a result, paying O(N^3|G|) is often prohibitive, even for N around 30 or 40. A simple solution is pruning: after you process each cell in the chart, throw out it's low probability items. You can either do count-based pruning (only keep the top 5 in each cell), ratio-based pruning (only keep the ones within a factor of 10 probability of the best one), or probability-based pruning (only keep the ones with probability at least 10e-6). Usually the last sucks, but the first two work reasonably well.
A probabilistic context free grammar (PCFG) is a CFG with "weights" attached to each of the rules. These weights must be non-negative. For any left-hand-side, the sum of weights must be one. The following is a small PCFG:
S 0.5 -> NP VP S 0.5 -> VP NP 0.7 -> Det Noun NP 0.2 -> Proper NP 0.1 -> NP PP VP 0.6 -> Verb NP VP 0.3 -> Verb VP 0.1 -> VP PP
Note that for any given LHS (eg., "NP") the sum of weights (probabilities) is one.
The probability of a tree is just the product of probabilities of rules used in that tree.
We can generate random sentences with a PCFG by starting at "S" and then picking a rule at random (weighted by the probabilities) and then recursively generating the corresponding children.
The probability of a sentence under this grammar is obtained by summing (marginalizing) out all the trees that can generate that sentence.
To parse with a PCFG, we just parse with the corresponding CFG, but then weigh items in the chart by their probabilities. If we find a higher-probability derivation of the same item in the chart, we throw out the old one and replace it with the new one. In CKY this is easy.
A treebank is a collection of human-labeled trees. Given a treebank, we can extract the corresponding PCFG by counting productions and normalizing. Most trees are not binary so you need to binarize your rules.
Note on smoothing: you have to smooth if you want your grammar to be able to parse out-of-sample trees. Smoothing can be done just like in HMMs, but you want to be careful not to compute stuff that is irrelevant.
Note on efficiency: The grammars we produce are huge (tens or hundreds of thousands of rules). As a result, paying O(N^3|G|) is often prohibitive, even for N around 30 or 40. A simple solution is pruning: after you process each cell in the chart, throw out it's low probability items. You can either do count-based pruning (only keep the top 5 in each cell), ratio-based pruning (only keep the ones within a factor of 10 probability of the best one), or probability-based pruning (only keep the ones with probability at least 10e-6). Usually the last sucks, but the first two work reasonably well.
04 October 2010
No class Tue 05 Oct 2010
Due to Hal being sick again. Sigh.
So here's a puzzle to entertain you while you're not learning all about treebanking and PCFGs....
Suppose Hal presents symptoms that make it 90% certain that he has some illness X. And suppose that the rate of transmission of X is 5%. Suppose that the average rate of incidence of X in a general population is 0.01%. What is the probability that you would get sick if we had class tomorrow (and you showed up)? There are 45 students in class: what's the probability that at least one would get sick? Finally, suppose that your utility function is defined as U(learn about pcfgs and don't get sick) = +10, U(don't learn and don't get sick) = 0, U(learn and get sick) = -20, U(don't learn and get sick) = -100. (How annoying would that be.) What is the expected utility if I do teach class? What is the expected utility if I cancel class? Am I doing the right thing?
Feel free to post solutions :).
So here's a puzzle to entertain you while you're not learning all about treebanking and PCFGs....
Suppose Hal presents symptoms that make it 90% certain that he has some illness X. And suppose that the rate of transmission of X is 5%. Suppose that the average rate of incidence of X in a general population is 0.01%. What is the probability that you would get sick if we had class tomorrow (and you showed up)? There are 45 students in class: what's the probability that at least one would get sick? Finally, suppose that your utility function is defined as U(learn about pcfgs and don't get sick) = +10, U(don't learn and don't get sick) = 0, U(learn and get sick) = -20, U(don't learn and get sick) = -100. (How annoying would that be.) What is the expected utility if I do teach class? What is the expected utility if I cancel class? Am I doing the right thing?
Feel free to post solutions :).
01 October 2010
Projec 2 is posted, due Oct 19
It's all about parsing. The writeup is really long, but don't be too intiminated by it. It's long because I tried to give a lot of debugging help. The project is all about grammars and parsing, so go off and do syntax!
28 September 2010
P1: free token for everyone!
Everyone should have one extra token, as a token (hahahahahahaha) of good will.
Remember that P1 is due at 2:20pm, and that HW04 is due next Tuesday before class.
Remember that P1 is due at 2:20pm, and that HW04 is due next Tuesday before class.
27 September 2010
Lecture 9: Dynamic programming and the CKY algorithm
We will begin by briefly attempting to recover from my confusion explanation of X-bar theory, and talk about specifiers, complements and adjuncts. We'll then finish up the DP versus NP distinction and IP for S, and then movement. We'll then get to the computational side...
Parsing as search
At a basic level, the parsing problem is to find a tree rooted at S (or IP...) that spans the entire sentence. We can think of this as a (bottom-up) search problem over partial hypotheses (eg., hypotheses that span an initial segment of the sentence).
Dynamic Programming
Search can be expensive because we end up re-creating the same constituents over and over again. This suggests dynamic programming. For the following, we will assume our grammar has been binarized, so rules always look like "A -> B C". Unary rules (eg., "A -> B") are do-able, but are a simple extension.
We will write sentences as "0 The 1 man 2 ate 3 a 4 sandwich 5" where the numbers indicate positions. A partial hypothesis has a label (eg., "NP") and a span (eg, "from 0 to 2"). We will write this as NP[0,2].
The key insight in dynamic programming is to store a chart. A chart is a two-dimensional array of (eg.) sets. Chart[i,j] will store all possible constituents that can span the range from i to j.
Initialization:
Time complexity is O(|sentence|^3 * |Grammar|).
Parsing as search
At a basic level, the parsing problem is to find a tree rooted at S (or IP...) that spans the entire sentence. We can think of this as a (bottom-up) search problem over partial hypotheses (eg., hypotheses that span an initial segment of the sentence).
- At any given time, we've built a partial tree over the first n words
- Want to extend it to include the n+1st word
- A search step involves:
- Choose a rule from the grammar
- Choose existing constituents that match that rule
- Create a new constituent
Dynamic Programming
Search can be expensive because we end up re-creating the same constituents over and over again. This suggests dynamic programming. For the following, we will assume our grammar has been binarized, so rules always look like "A -> B C". Unary rules (eg., "A -> B") are do-able, but are a simple extension.
We will write sentences as "0 The 1 man 2 ate 3 a 4 sandwich 5" where the numbers indicate positions. A partial hypothesis has a label (eg., "NP") and a span (eg, "from 0 to 2"). We will write this as NP[0,2].
The key insight in dynamic programming is to store a chart. A chart is a two-dimensional array of (eg.) sets. Chart[i,j] will store all possible constituents that can span the range from i to j.
Initialization:
- For each position i=1..|sentence|
- For each possible part of speech P for word i
- Insert P into Chart[i-1,i]
- (This means that we've found a way to cover i-1 to i using non-terminal P)
- For phrase lengths l=2..|sentence|
- For each start position i=0..|sentence|-1
- Set end position k=i+2
- For each split point j=i+1..k-1
- For each grammar rule X -> Y Z
- If we can find a Y in Chart[i,j] and a Z in Chart[j,k]:
- Add X to Chart[i,k] with backpointers ala Viterbi
- (This means we've found a way to cover i to k using the non-terminal X)
Time complexity is O(|sentence|^3 * |Grammar|).
24 September 2010
23 September 2010
Lecture 8: Context Free Grammars
Context free grammars:
* strictly more powerful way of expressing languages than a regular expression
* it's languages are called context free languages
* many human languages are approximately context-free (there are always exceptions), while most are not regular
Context free grammars consist of:
* An alphabet (the "words" in the language)
* A set of non-terminals
* A "start" non-terminal
* A set of rewrite rules for getting from non-terminals to either other non-terminals or terminals
Simple context free grammar:
This is actually a finite language (what are the sentences in this language)?
Derivations are "proofs" that a sentence belongs to the language by giving a/the sequence of productions that leads to that sentence.
We'll usually draw trees as proofs, but other formats are more useful as text, for instance: (S (NP (D the) (N sandwich)) (VP (V ate) (NP (D the) (N man))))
In such grammars, we will often call the things just above the terminals (parts of speech) "pre-terminals".
We can make this grammar more fun by adding rules:
Now we have an infinite grammar that can generate strings like "the man with the man with the sandwich with the man with the ..."
Note also that this grammar is now ambiguous. "the man likes the man with the sandwich" could either be (S (NP the man) (VP (V likes) (NP (NP the man) (PP (P with) (NP the sandwich))))) or (S (NP the man) (VP (VP (V likes) (NP the man)) (PP (P with) (NP the sandwich)))). The first one has the "correct" meaning, while the second would mean that he uses the sandwich to like the man.
A simple context free language that is not regular is:
Which describes the language "{ a^n b^n : n > 0}" of strings of equal numbers of as and bs. You can show this language is not regular formally, or you can recognize that a FSM would kind of have to have a N-many states to recognize the finite version of this language up to sequences of length N.
A popular approach for context free models of human language is "X-bar" theory. Here, we have base categories (X0 = N,V,A,P), then "projections" of these categories: X' (aka X-bar), and X'' (aka X-double-bar aka XP). The general form of a phrase is (XP (YP (X' (X0 ZP)))), where YP is the "specifier", X0 is the head, and ZP is the complement (argument). Only full projections can fit in as adjunctions, specifiers or complements.
The idea is that the arguments fit in the "bar" slots, and the adjuncts fit in the "phrase" slots. Roughly, an argument is something that is sort-of necessary and an adjunct is sort-of optional. Usually you can have arbitrarily many adjuncts, but only a fixed number of arguments. For instance in "the man ate a sandwich of turkey with lettuce on a plate with gusto at the restaurant" the verb "ate" has one argument (sandwich) and two adjuncts ("with gusto" and "at the restaurant") while sandwich has three adjunctions "with turkey, with lettuce and on a plate).
Here are some example sentences in X-bar style analyses:
(Warning: by now, lots of people no longer think NPs are really headed by nouns, but are rather headed by determiners, so that what we usually think of as an "NP" is really a "DP." That is, instead of "(NP (N' (DP (D' (D the))) (N man)))" you would have "(DP (D' (D the) (NP (N' (N man)))))". In this class, we'll always use NP and not DP, but you should know that this is slightly outdated.)
* strictly more powerful way of expressing languages than a regular expression
* it's languages are called context free languages
* many human languages are approximately context-free (there are always exceptions), while most are not regular
Context free grammars consist of:
* An alphabet (the "words" in the language)
* A set of non-terminals
* A "start" non-terminal
* A set of rewrite rules for getting from non-terminals to either other non-terminals or terminals
Simple context free grammar:
S -> NP VP NP -> D N VP -> V NP D -> "the" D -> "a" N -> "man" N -> "sandwich" V -> "ate" V -> "likes"
This is actually a finite language (what are the sentences in this language)?
Derivations are "proofs" that a sentence belongs to the language by giving a/the sequence of productions that leads to that sentence.
We'll usually draw trees as proofs, but other formats are more useful as text, for instance: (S (NP (D the) (N sandwich)) (VP (V ate) (NP (D the) (N man))))
In such grammars, we will often call the things just above the terminals (parts of speech) "pre-terminals".
We can make this grammar more fun by adding rules:
NP -> NP PP VP -> VP PP PP -> P NP P -> "with"
Now we have an infinite grammar that can generate strings like "the man with the man with the sandwich with the man with the ..."
Note also that this grammar is now ambiguous. "the man likes the man with the sandwich" could either be (S (NP the man) (VP (V likes) (NP (NP the man) (PP (P with) (NP the sandwich))))) or (S (NP the man) (VP (VP (V likes) (NP the man)) (PP (P with) (NP the sandwich)))). The first one has the "correct" meaning, while the second would mean that he uses the sandwich to like the man.
A simple context free language that is not regular is:
S -> a S b S -> ab
Which describes the language "{ a^n b^n : n > 0}" of strings of equal numbers of as and bs. You can show this language is not regular formally, or you can recognize that a FSM would kind of have to have a N-many states to recognize the finite version of this language up to sequences of length N.
A popular approach for context free models of human language is "X-bar" theory. Here, we have base categories (X0 = N,V,A,P), then "projections" of these categories: X' (aka X-bar), and X'' (aka X-double-bar aka XP). The general form of a phrase is (XP (YP (X' (X0 ZP)))), where YP is the "specifier", X0 is the head, and ZP is the complement (argument). Only full projections can fit in as adjunctions, specifiers or complements.
The idea is that the arguments fit in the "bar" slots, and the adjuncts fit in the "phrase" slots. Roughly, an argument is something that is sort-of necessary and an adjunct is sort-of optional. Usually you can have arbitrarily many adjuncts, but only a fixed number of arguments. For instance in "the man ate a sandwich of turkey with lettuce on a plate with gusto at the restaurant" the verb "ate" has one argument (sandwich) and two adjuncts ("with gusto" and "at the restaurant") while sandwich has three adjunctions "with turkey, with lettuce and on a plate).
Here are some example sentences in X-bar style analyses:
- (PP (DP (D' (D so))) (P' (ADVP (ADV' (ADV completely))) (P' (P in) (NP (N' (DP (D' (D the))) (N wrong))))))
- (IP (NP Poirot) (I' (Infl will) (VP (V' (V abandom)) (NP (N' (DP (D' (D the))) (N investigation))))))
- Identical structure for the second half of "I did not expect Poirot to abandon the investigation."
- Under some theories, (IP (NP Poirot) (I' (Infl +ed) (VP (V abandon) (NP the investigation))) is appropriate
(Warning: by now, lots of people no longer think NPs are really headed by nouns, but are rather headed by determiners, so that what we usually think of as an "NP" is really a "DP." That is, instead of "(NP (N' (DP (D' (D the))) (N man)))" you would have "(DP (D' (D the) (NP (N' (N man)))))". In this class, we'll always use NP and not DP, but you should know that this is slightly outdated.)
21 September 2010
HW3: you may ignore the part about Chomsky normal form
If you've already done it, or choose to do it anyway, I'll give you a tiny bit of extra credit. I thought we'd be a bit further alon by now when I wrote that homework a few weeks ago. (You should still make a semi-valiant effort on the first part of the CFG question).
16 September 2010
Lecture 6: Part of Speech Tagging
Woohoo, some linguistics!
Parts of speech are useful for several things:
Note: not all parts of speech exist in every language, and if a words is in one part of speech in language X, it's translation isn't necessarily the same part of speech in language Y. (Eg., beautiful in English/Korean, or green in English/Japanese).
Some major open classes:
Some closed class:
Rule based approaches usually consist of:
Rules get more complicated...
Parts of speech are useful for several things:
- Knowing how to pronounce a word (object noun vs verb)
- Knowing what words come around it (personal pronouns vs possessive pronouns)
- Knowing how it behaves syntactically -- most verbs behave the same and are "substitutable" from a syntactic perspective (though probably not from a semantic perspective)
Note: not all parts of speech exist in every language, and if a words is in one part of speech in language X, it's translation isn't necessarily the same part of speech in language Y. (Eg., beautiful in English/Korean, or green in English/Japanese).
Some major open classes:
- Nouns: things. Concrete (ship, chair), abstract (happiness) or verb-like terms "gerunds" (running), proper "Hal"
- Verbs: actions. Eat, drink, sleep; English has auxiliaries but these are usually considered closed class (see below)
- Adjectives: mostly modify nouns (blue, big, old)
- Adverbs: mostly modify verbs (slowly, very) but also other weird things (every, unfortunately, not)
Some closed class:
- Prepositions: the bunny ran ___ the hill
- Determiners: articles: a, an, the; others: this blog
- Pronouns: (personal) she, I, who, (possessive) my, her, (wh) what, how
- Conjunctions: and, but, either/or
- Auxiliary verbs: would, could, can, to -- but not the preposition "to"
- Particles: like prepositions, but go with verbs (put down, raise up, etc.)
- Numerals: one, two, three, first, second, third, etc.
Rule based approaches usually consist of:
- A small lexicon that covers some common open-class words and most common closed-class words
- A set of rules for disambiguation
- A set of heuristics for guessing the tag of unknown (open class) words
- Does it end in "ed"? "ly"? "ing"? "s"?
- Is it capitalized? (Doesn't apply to P1)
Rules get more complicated...
- Noun/Verb disambiguation is usually the biggest issue
- Is there a determiner or adjective to my left? How far?
- Am I at the end of a sentence?
- Is the an auxiliary to my left? An adverb?
- There's also Verb/Aux disambiguation
- Verb Aux is rare order, but Aux Verb is common
- Is there a verb anywhere in the sentence?
- Noun/WH-complementizer disambiguation
- John saw that book
- John saw that book burn up
- Is there a verb to the right, nearby?
15 September 2010
Slightly updated notes.sty file
If you're annoyed with the "solution" environment in the notes.sty file that I've provided, please download the new one off the web page. It is much friendlier and will let you have better equations, line breaks, etc., inside solutions. (Thanks to your fellow student, Ke Wu, for the patch!)
14 September 2010
Lecture 5: N-gram models
Today we'll talk first about estimation of:
We'll then run through the chain rule to break p(s) = p(w_1 w_2 ... w_N) = prod_n p(w_n | w_{1..n-1}) and then apply a Markov assumption to the terms in the product to arrive at the n-gram language model.
This language model will be a product of Multinomial random variables, so we can apply everything we know about estimation.
Finally, we'll talk about:
(Warning: the treatment we give of out of vocabulary -- "OOV" -- terms will not be very detailed!)
- Bernoulli random variables (coin flips)
- p(Heads) ~= Count(Heads) / Count(total flips)
- Multinomial random variables (die rolls)
- p(die comes up "4") ~= Count("4"s) / Count(total rolls)
- Now think about a giant die with |Vocabulary|-many sides!
We'll then run through the chain rule to break p(s) = p(w_1 w_2 ... w_N) = prod_n p(w_n | w_{1..n-1}) and then apply a Markov assumption to the terms in the product to arrive at the n-gram language model.
This language model will be a product of Multinomial random variables, so we can apply everything we know about estimation.
Finally, we'll talk about:
- Smoothing (adding fake counts)
- Structural zeros versus statistical zeros
(Warning: the treatment we give of out of vocabulary -- "OOV" -- terms will not be very detailed!)
Slight change to P1 "tagger" script
The old rules were that if you ran the online tagger at time X, you would have to wait until time X+24 hours to run it again.
To be nice to people who have work schedules, this has been "softened." The new policy is that every 24 hours you earn one "token." You spend tokens by running the tagger. However, you can only keep a total of three tokens at any one time.
The effect of this policy is that you could continue doing one handin every 24 hours, or you could do two every 48 hours, or three every 72 hours. But not more than that. So on average it's the same, but you get a bit more flexibility.
The script is a bit more complicated now (the logic is harder), so please let me know if you run into any bugs.
Also, if you run the handin and just enter your ID, but not a file to upload, it will tell you how many tokens you have left and, if it's less than three, the date/times that those tokens were used so you can figure out how long until you get a new one (you get a new one when the oldest one "dies" -- i.e., 72 hours pass since the oldest one).
Hopefully this is clear :).
To be nice to people who have work schedules, this has been "softened." The new policy is that every 24 hours you earn one "token." You spend tokens by running the tagger. However, you can only keep a total of three tokens at any one time.
The effect of this policy is that you could continue doing one handin every 24 hours, or you could do two every 48 hours, or three every 72 hours. But not more than that. So on average it's the same, but you get a bit more flexibility.
The script is a bit more complicated now (the logic is harder), so please let me know if you run into any bugs.
Also, if you run the handin and just enter your ID, but not a file to upload, it will tell you how many tokens you have left and, if it's less than three, the date/times that those tokens were used so you can figure out how long until you get a new one (you get a new one when the oldest one "dies" -- i.e., 72 hours pass since the oldest one).
Hopefully this is clear :).
13 September 2010
HW03 is posted
(You'll need to look at it for the HMM part of P1.)
10 September 2010
Project matchmaking
If you need, or want, one or more partners for the first project (you're free to switch up between projects if you'd like), please email me now (before you forget!). We'll have a 5-10 minute matchmaking session at the end of class on Tuesday. When you email me, please give me your name and your department (CS, EE, Linguistics, etc...) and whether you have a team at all (eg., a team of two who wants one more) or are currently alone (i.e., need one or two people to make a team).
09 September 2010
Lecture 4: Probability and statistics
You should be sure that you are comfortable with the basic rules of probability:
You should also be comfortable with some basic statistics:
- Conditional probabilities: what does p(a|b) mean?
- Marginalization: p(a) = sum_b p(a,b)
- Chain rule: p(a,b) = p(a) p(b|a) = p(b) p(a|b)
- Iterated chain rule: p(a,b,c,d,...,z) = p(a) p(b|a) p(c|a,b) ... p(z|a,b,c,...,y)
- Bayes' rule: p(a|b) = p(a) p(b|a) / p(b)
- Generalized Bayes' rule: how to do BR on p(a|b,c) or p(a,b|c)
- Expectations: E_p[f(x)] = sum_x p(x) f(x)
You should also be comfortable with some basic statistics:
- Maximum likelihood estimates
- = relative frequencies
- = counting and normalizing
- The difference between
- Structural zeros
- Statistical zeros
- The importance of smoothing
07 September 2010
Project 1 is posted, due 30 Sept
Use this space to ask questions / discuss the project.
[Update 14 Sep] Several people have asked what tools/data are allowed for the rule-based tagger and for the "do as well as you can" part. Here is an attempt to clarify. My goal isn't to make life hard on you, but to not give unfair advantages to some subsets of the class.
[Update 14 Sep] Several people have asked what tools/data are allowed for the rule-based tagger and for the "do as well as you can" part. Here is an attempt to clarify. My goal isn't to make life hard on you, but to not give unfair advantages to some subsets of the class.
- For the rule-based tagger, any rules are fair game, but it would be better if you made your tagger WITHOUT looking at the labeled training data. That is, feel free to look at the sentences, but pretend that you don't have access to the tags. After all, that's how it would be in the real world. Yes, this means that you'll have to use some linguistics. That's not a terrible thing :).
- For the last part of the project, where you have to build the best tagger possible, people have asked about using external tools. For this project, I would ask that anything you use beyond normal library / unix tools stuff, you implement yourself. Yes, you could probably do better using well-engineered tools, but that probably won't help you learn as much as you will if you do something slightly less good, but do it on your own.
- (Note that this final rule will not be the case for the final class project: there, anything is fair game!)
06 September 2010
How to get a little extra credit...
Students always ask me about extra credit opportunities at the end of the semester. This is your opportunity to be pro-active. The CL lab here has a regular colloquium series, which has a shared calendar here. Roughly every other week we have an external speaker coming through. The first few are Roger Levy (psycho/cognitive computational linguistics from UCSD), William Cohen (machine learning and information extraction, CMU) and Eugene Charniak (parsing and everything else NLP, Brown). (Eugene is actually my great-grand-advisor!). I highly encourage all of these talks, and they should all be quite different. We'll have more folks posted on the schedule as the semester moves on. At any rate, if you go to the talk, and write a "response" that's approximately one page long, I'll give you some extra credit. You'll get (up to) 3 points (one homework) for the first one you do, (up to) 2 points for the second and third, and 1 point for the fourth and fifth. That's 9 points of possible extra credit. You won't get further credit for more than five, but of course I encourage you to keep going to the seminars :).
Subscribe to:
Posts (Atom)