29 October 2010
28 October 2010
Lecture 16: Computational Semantics
A recent development in NLP-land is the interest in "semantic role labeling." I tend to think of this as "Parsing++", but it's basically a halfway point between "syntax" (which we've already passed a bit) and "semantics" (which we don't know how to do).
The idea in SRL is to find predicates (mostly verbs) and identify the arguments of these verbs (eg., Agents, Patients, Experiencers, Themes, etc.). For example:
The general approach to doing semantic role labeling is a pipeline of "first parse, second find verbs, third find the arguments to those verbs." (There's some argument as to whether parsing is strictly necessary, but it's clearly at least helpful.)
The biggest challenge here is finding the arguments. Basically we'll go over the parse tree, find potential arguments (basically internal nodes in the tree), and then try to use statistical information to figure out which nodes correspond to which arguments (or none).
First, we'll identify all potential argument candidates:
A more sane approach is to use instead (or in addition) tree paths. This idea has come up in a lot of places, and often works pretty well. If we look at the parse of this sentence:
There are lots of ways to represent this path, but a baseline might be to say that the path looks like:
However we encode the path, we now need to estimate something like P(means | path). A naive approach would be to Bayes' rule this to get P(means) * P(path | means) and then we could model the path with a Markov model. Or, if you've taken machine learning, you can use the path as a collection of features in some learning algorithm (or if you're being fancy, turn it into a kernel!). We'll talk about this more when we talk about learning.
As a final note, Matthew Gerber and Joyce Chai got best paper at ACL 2010 for their paper Beyond NomBank: A Study of Implicit Arguments for Nominal Predicates, where they look at similar things for nominal predicates rather than verbal predicates. In this case, I agree: this is a great paper.
The idea in SRL is to find predicates (mostly verbs) and identify the arguments of these verbs (eg., Agents, Patients, Experiencers, Themes, etc.). For example:
- John ate a sandwich
- For the predicate ate, John is the agent and sandwich is the patient
- A sandwich was eaten by John
- Same as above.
- Bill saw John eat a sandwich
- For the predicate ate, it's the same as above
- For the predicate saw, the agent is Bill
The general approach to doing semantic role labeling is a pipeline of "first parse, second find verbs, third find the arguments to those verbs." (There's some argument as to whether parsing is strictly necessary, but it's clearly at least helpful.)
The biggest challenge here is finding the arguments. Basically we'll go over the parse tree, find potential arguments (basically internal nodes in the tree), and then try to use statistical information to figure out which nodes correspond to which arguments (or none).
- John saw a man with a telescope.
First, we'll identify all potential argument candidates:
- John
- a
- man
- a man
- with
- a
- telescope
- a telescope
- with a telescope
- John
- man
- a man
- telescope
- a telescope
- with a telescope
A more sane approach is to use instead (or in addition) tree paths. This idea has come up in a lot of places, and often works pretty well. If we look at the parse of this sentence:
- (S (NP John) (VP (V saw) (NP (DT a) (NN man)) (PP (P with) (NP (DT a) (NN telescope)))))
There are lots of ways to represent this path, but a baseline might be to say that the path looks like:
- up:VP down:3:PP down:2:NP down:2:NN
However we encode the path, we now need to estimate something like P(means | path). A naive approach would be to Bayes' rule this to get P(means) * P(path | means) and then we could model the path with a Markov model. Or, if you've taken machine learning, you can use the path as a collection of features in some learning algorithm (or if you're being fancy, turn it into a kernel!). We'll talk about this more when we talk about learning.
As a final note, Matthew Gerber and Joyce Chai got best paper at ACL 2010 for their paper Beyond NomBank: A Study of Implicit Arguments for Nominal Predicates, where they look at similar things for nominal predicates rather than verbal predicates. In this case, I agree: this is a great paper.
26 October 2010
Lecture 15: Computational Lexical Semantics
Today's topic is word sense disambiguation. This is a very "heated" topic and we'll discuss in class (not on the public blog!) why this might be!
Here are some examples of the verb "drive"... which are different senses?
Two of the most important concepts in WSD are:
Here are some examples of the verb "drive"... which are different senses?
- Can you drive this four-wheel truck?
- We drove to the university every morning.
- She drove me to school every day.
- He drives me mad.
- She is driven by her passion.
- They drove back the invadors.
- She finally drove him to change jobs.
- He drove a nail into the wall.
- The player drove the ball far out into the field.
- She is driving away at her doctoral thesis.
- What are you driving at?
- My new truck drives well.
- He drives for the taxi company in DC.
- The car drove around the corner.
- The farmers drove the cows into the barn.
- We drive the beltway to work.
- He drove a golf ball across the street.
- I hit the ball with a bat and drove it into left field.
- The workers drove a tunnel through the hill.
- These engines are driven by steam.
- The hunter drove the forest looking for deer.
- After finding deer, he drove it into open ground.
Two of the most important concepts in WSD are:
- One sense per discourse (which may or may not be true...)
- One sense per collocation (which also may or may not be true...)
21 October 2010
HW07 is posted
(I'm trying to get better about posting these early, like in the beginning of the semester when I wasn't as behind schedule on everything!)
Lecture 14: Linguistic Challenges
Final project details:
- Groups, just like normal projects.
- Scope: something that with an extra month of work could turn into a paper at a major conference (ACL, EMNLP, NAACL, etc....)
- Four styles of projects:
- Reimplement some cool, interesting system from the past 2 or 3 years in an area of NLP/CL that you think is interesting. (+1 month would be to figure out what it does wrong, propose some new idea, and then get a paper out of it.)
- Take some problem you care about from your own research and look at how NLP can help it. (You're a better judge than I am about turning this into a paper, but at least you care about the topic!)
- Pick some interesting, and under-studied linguistic phenomena (like the things we're talking about today) and do a study on this phenomena. Things not in English are totally allowed, but it would help if you knew the language you were working with. If you want to do data collection using mechanical turk, I have a budget of $200 for this: we can see how many people want to do this and then try to divide fairly (note: fairly is not the same as evenly).
- Something else you can convince me is interesting...
- You will be graded entirely on your writeup, which should look more or less like a 4-6 page ACL workshop paper (you're encouraged to use the ACL style files).
- Metaphor. This can be narrowly construed ("John devoured his steak like a lion") or broadly construed ("John drove Mary crazy"). The usual theory is that there are concrete, physical verbs/nouns; and then when we want to talk about something abstract, we ground it in one of these physical verbs. So a metaphor will usually consist of one abstract thing and one physical thing. There are interesting things to be done both on interpreting metaphors as well as generating metaphors. I have a small amount of data.
- Quantifier degree. I can say "Most men are jerks" and I can say "Most jerks drink beer" but does this mean that most men drink beer (or rather, that I think that most men drink beer?). What about other quantifiers? How consistent is this across different statements ("Most men are jerks" versus "Most animals are invertebrates" -- is the meaning of "most" the same)? I don't have any data, but you could collect it.
- Metonymy. This is when I say "The White House" and really mean "Obama" (or at least I mean "someone from the White House"). This combines with ideas of selectional preferences, where we might be able to figure out that the agent (or subject) of "said" should be animate, and might know that the White House is not animate, and so it must be metonymic. Can we identify metonymic usages? Can we resolve them? I have a reasonable amount of data.
- Quantifier scope. This is a classic semantics problem. "All the women on the island have a husband." Does this mean that each woman has her own husband, or that there's one man who is married to all of the women simultaneously. Similarly, "Each man entered the bar with a dog" -- how many dogs are there? I have no data.
- Negation scope. This is perhaps particularly interesting in the case of sentiment analysis. "I did not like the plot but liked the acting." -- Does "not" scope over just "like the plot" or over the full "like the plot but liked the acting"? There is also an interaction with quantifiers ("No man should marry a leprechaun"), but this is both rare and hard. I have no data, but I think we could find some.
- Language change. We have English documents going back 500 years, but English back then looks a lot different than English now. Moreover, our part of speech taggers and parsers that work well now won't work well on Shakespeare (I would guess :P). Can we track changes back in time, and use this information to get today's models to do a good job way back when? I have a lot of data.
- Traces and Ellipsis. "John went to the store but Jerry didn't ______" -- what was it that Jerry didn't do. This is a form of a trace. The Penn Treebank comes with some but not all traces spelled out. Try to figure out where they should be and what they refer to.
- Exaggeration. No one has ever looked at this problem. Okay that's not true, but only because it's an overstatement. Can we automatically identify overstatements in text? Can we tone them down or at least identify the degree of overstatement? I have no data, but it shouldn't be too hard to find some.
- Emotion from text. Can we identifying character's emotions by looking at what they say, for instance in novels. I have data of quoted speech, paired with emotional strength -- can we predict emotional strength given the quoted speech and perhaps the context in which it was spoken? I have lots of data.
- Language and X. Often we can get language paired with objects in another medium, for instance images and captions. Can we identify correspondences? Or perhaps generate captions from images? We have (or can get) similar data for comments and code, or formal specifications (variant of first order logic) and descriptions. Or database queries and text. I have some data, depending on what you want "X" to be.
- Shannon game (word/character prediction) with domain knowledge.... The Shannon game is a way of measuring entropy of a language. Perhaps make an online game (like hangman?) where we can get people to play this game, and then can also tell them things like "this is a news story" or "this is a children's book" or "this is a medical text" or something and see how it affects their ability to guess the next work.
- "Translating" from tweets to language that your parents could understand. We have lots of SMS/twitter data, and lots of normal text. A lot of this is abbreviation expansion, but some is not. Can we figure out which terms in tweets should be expanded, and then try to find their expansions in normal text? There are at least two types ("something" -> "sth" or "laugh out loud" -> "lol"), but maybe more. We can get lots of data easily.
- Identifying character relationships in literature. We can track character interactions in literature, but the relationship between these characters is not known. And perhaps it changes over time. Can we figure out how two characters in a novel feel about each other on the basis of what they do/say to each other? I have some data and we can get a lot more easily.
- Many other languages have phenomena that English does not. For example, subject dropping happens in lots of languages (eg., Japanese) when it's clear from context. Also stuff in the Semantic Analysis chapter of J+M is pretty much up for grabs.
19 October 2010
Lecture 13: Interpretation as Abduction
"Interpretation as Abduction" is one of the more infuential computational semantics papers from its era (and really, ever).
The basic claim is that through (weighted) abduction, one can explain a whole host of complex linguistic phenomena:
Abduction is the type of inference we try to do in science. We observe something, and we try to posit what caused this to happen. For instance, we may observe that apples fall from trees. We can posit this is due to gravity. Or we can posit that this is due to little elves that hang out in apples and little pots of gold in the earth and the elves are trying to get to the pots of gold. One of these is a more plausible explanation than the other, since it requires us to invent fewer things. (Though as far as I know, no one really knows how gravity works... so maybe it's a bad example... maybe it is elves/gold.) This is often referred to as Occam's Razor: of all explanations, choose the "simplest."
The key idea to use abduction to do sentence interpretation is:
In weighted abduction, we assume that there is a cost for coercions and making assumptions, and that we want to find the set of things we can assume that have the least possible cost.
The idea is that our database might contain entries like:
If we have redundancy... eg., something like "blah blah q(x) and q(y) blah blah" and at some point we derive x=y, then we can merge this to just "blah blah q(x) blah blah" and the cost of q(x) is the minimum of the costs of q(x) and q(y).
For example, we might have:
One kind of cool outcome of this is that you can make "reverse" inferences fairly easily. For instance, maybe I want to be able to assume that sometimes when people talk about animals, they're actually talking about cats. But they're being vague. I can just add a rule:
It turns out (see section 6 in the paper) that you can actually handle all of syntax, semantics and pragmatics in this single framework!
The basic claim is that through (weighted) abduction, one can explain a whole host of complex linguistic phenomena:
- Reference resolution ("John_1 saw himself_1 in the mirror" -- where "John" and "himself" co-refer)
- Interpretation of compound nominals ("computer science" versus "cs department")
- Syntactic ambiguity (eg., PP attachment)
- Metonymy ("The White House announced that..." really means that someone in the White House announced something)
- Etc....
Abduction is the type of inference we try to do in science. We observe something, and we try to posit what caused this to happen. For instance, we may observe that apples fall from trees. We can posit this is due to gravity. Or we can posit that this is due to little elves that hang out in apples and little pots of gold in the earth and the elves are trying to get to the pots of gold. One of these is a more plausible explanation than the other, since it requires us to invent fewer things. (Though as far as I know, no one really knows how gravity works... so maybe it's a bad example... maybe it is elves/gold.) This is often referred to as Occam's Razor: of all explanations, choose the "simplest."
The key idea to use abduction to do sentence interpretation is:
- Prove the logical form of the sentence
- together with the constraints that predicates impose on their arguments (eg., the subject of "run" must be animate)
- allowing for coercions (eg., we might coerce someone who works at the White House into the White House)
- Merging redundancies where possible
- Making assumptions where necessary
In weighted abduction, we assume that there is a cost for coercions and making assumptions, and that we want to find the set of things we can assume that have the least possible cost.
The idea is that our database might contain entries like:
- If P and Q then R
- If P / w1 and Q / w2 then R
If we have redundancy... eg., something like "blah blah q(x) and q(y) blah blah" and at some point we derive x=y, then we can merge this to just "blah blah q(x) blah blah" and the cost of q(x) is the minimum of the costs of q(x) and q(y).
For example, we might have:
- (forall x) car(x) / 0.8 AND notop(x) / 0.4 IMPLIES convertible(x)
One kind of cool outcome of this is that you can make "reverse" inferences fairly easily. For instance, maybe I want to be able to assume that sometimes when people talk about animals, they're actually talking about cats. But they're being vague. I can just add a rule:
- (forall x) animal(x) AND etc(x) IMPLIES cat(x)
It turns out (see section 6 in the paper) that you can actually handle all of syntax, semantics and pragmatics in this single framework!
18 October 2010
15 October 2010
P2 deadline and ML midterm
I've been informed that there's a midterm in Lise's ML class on the 19th, which coincides with the P2 deadline.
To try to help you not lose too much sleep, I've pushed the deadline for P2 back to Friday night at 10p. However, to be fair to those who don't have ML, or who have planned their schedule around the deadline on the 19th, I'll give you 15% extra credit if you hand it in by the original deadline (19th). If you do hand it in early, please make a note in your writeup that you've done so. Hopefully 15% is enough to balance out anything that you would lose by not being able to compete up to the last minute with your classmates on the competition parts.
Also, a note on pruning:
This is now in the .pdf writeup, but:
To try to help you not lose too much sleep, I've pushed the deadline for P2 back to Friday night at 10p. However, to be fair to those who don't have ML, or who have planned their schedule around the deadline on the 19th, I'll give you 15% extra credit if you hand it in by the original deadline (19th). If you do hand it in early, please make a note in your writeup that you've done so. Hopefully 15% is enough to balance out anything that you would lose by not being able to compete up to the last minute with your classmates on the competition parts.
Also, a note on pruning:
This is now in the .pdf writeup, but:
- Be sure that you prune all cells, including the right-most ones. Your labeled F measure for unpruned should be about 78%.
- Don't forget to debinarize before you evaluate!
- If you prune every cell, including the root, with K=2, your labeled F measure should be about 23%. You should have about 83 unparsable sentences.
- If you prune every cell except the root (which is a totally sensible thing to do -- thanks to several of you who have pointed this out -- your labeled F should be about 59%, and you should have about 38 unparsable sentences. (As you can see, it's a really good idea not to prune the root, especially since it doesn't make things any slower -- In fact, it might actually make things faster!)
- You can implement it either of the above two ways -- just be sure to document which you're doing.
- Be sure to prune before applying unary rules. Thus, even with K=2, you will often end up with more than 2 items in a chart cell. However, only 2 of these will have binary derivations; all the rest should have unary derivations.
14 October 2010
Lecture 12: First Order Logic
So far we've talked a lot about syntax. Of course, syntax isn't the end-all be-all of language processing: we want to "understand" language. Often this means mapping natural language sentences to some form of formal language sentences. This means we have to decide on a formal language. A common choice is first order logic (FOL).
Expressions in FOL take one of the following forms:
FOL is a nice language to map into because we know a lot about how to reason in FOL. Reasoning is, of course, hard (take an AI class), but sort of do-able.
The general way in which FOL systems work is:
The common assumption (which is not always true) is that language is compositional. Namely, the meaning of a sentence is derived from the meanings of its parts. For examples:
In order to make this happen, we usually need a semantic grammar that tells us how to get from syntax to semantics. The grammar will consist of rules that are akin to the type of rules we applied to do the semantic analysis of the above example sentence.
Key challenges with FOL:
Expressions in FOL take one of the following forms:
- Atomic symbols like "Hal"
- Relation symbols like "Eats(.,.)", for instance in "Eats(Hal, Sandwich)". Relations can have arbitrary arity (i.e., take arbitrarily many arguments)
- Variables like "a" (see later for quantification)
- Combinations via and, or, not, implies, iff, etc. Eg., Eats(Hal, Sandwich) implies not OnThePlate(Sandwich)
- Existential and universal quantifiers. For example, "(exists a) Eats(Hal, a)" (which roughly means "Hal eats something.")
- Equality symbol. For example, "(exists a) Eats(Hal, a) and (a = Sandwich)"
FOL is a nice language to map into because we know a lot about how to reason in FOL. Reasoning is, of course, hard (take an AI class), but sort of do-able.
The general way in which FOL systems work is:
- We have a bunch of background knowledge expressed as "axioms" -- basically a database of FOL sentences
- We get a text, from which we derive new FOL sentences
- Someone issues a query, which we will try to "prove" given our collection of sentences. A proof of a query is an answer to the question
- Someone might query "Who shot Lincoln?"
- We map this to something like "(exists a) Shot(a, Lincoln)"
- We try to prove this statement given our database, which might include some statement like "Shot(Booth, Lincoln)". Thus, "a=Booth" is a "proof" of our query, and gives us the answer
The common assumption (which is not always true) is that language is compositional. Namely, the meaning of a sentence is derived from the meanings of its parts. For examples:
- Hal ate a sandwich with mayo.
- Parses to (Hal (ate (a sandwich (with mayo))))
- Bottom up, we can look at (mayo) and derive "(exists a) Mayo(a)"
- Then (sandwich (with mayo)) and derive "(exists b) Sandwich(b) and Contains(Sandwich, ___)" where ___ comes from the interpretation of (with mayo), yielding "(exists a,b) Mayo(a) and Sandwich(b) and Contains(a,b)"
- Then (a sandwich ...), takes what we had before and makes "b" unique, usually written as "(exists a, exists!b) ..."
- Then we take Hal and get "Hal"
- And finally, "x ate y" will give us something like "(exists e) Eat(e, x, y) and InThePast(e)" which when we plug in with x and y gives the final interpretation:
(exists a, exists !b, exists e) Mayo(a) and Sandwich(b) and Contains(a,b) and Eat(e, Hal, b) and InThePast(e)
In order to make this happen, we usually need a semantic grammar that tells us how to get from syntax to semantics. The grammar will consist of rules that are akin to the type of rules we applied to do the semantic analysis of the above example sentence.
Key challenges with FOL:
- What are the right sets of symbols and relations?
- Where does the necessary background knowledge come from?
- You can try to get humans to do it for you: see the Cyc project
- Or you can try to extract it automatically: see ReadTheWeb or ODIE or TextRunner, for example
- How to get the semantic grammar
- You can hand code it (see verbnet and the unified verb index for a verb-centric view).
- Or you can try to learn it from examples (see Ray Mooney's talk yesterday)
- Sometimes need more than FOL
- How to handle "Many students love NLP"?
- Or "John did something yesterday" -- for this we may need second order logic, where you can quantify over relations -- "(exists R, e) R(e, John) and InThePast(e)"
12 October 2010
Project 2 warning on pruning results
I had a bug in my pruning, and wasn't pruning the right-most cells. (It was an off-by-one error.) Since English is mostly right-branching, the cells that touch the right edge (i.e., end of sentence) are incredibly important, so not pruning them makes life a lot better than otherwise. I'm working on fixing it right now and checking to make sure there aren't other bugs. I'll post a comment to this post when it's done and I've updated the results in the .pdf, but your pruning results should be much worse than what is in the writeup (i.e., with K=2, you should get no parse on 80-something sentences, and your f-score should be in the 10s or 20s).
Project 2 stuff...
I fixed the P2 pdf (just the one linked on the web page, not the one in the .tar.gz file) so that it includes unary productions in the examples. This is only relevant for part 2 -- nothing else changed.
Also, I'm not sure if I included them previously, but here are the links for the online test scripts:
Also, I'm not sure if I included them previously, but here are the links for the online test scripts:
HW05 is posted
(Sorry for the delay -- there's just one question :P)
Lecture 11: Incorporating Context
Many languages express some form of agreement. In English it is mostly Subject/Verb agreement (i.e., "He eats a sandwich" versus "They eat a sandwich"). In other languages agreement can be more widespread: Adjective/Noun, Determiner/Noun, Verb/Object, Noun/Relative Clause, etc.
Suppose we want to capture this with a (P)CFG.... the only real way to do this is to expand our set of non-terminals. I.e., we might start with a grammar like:
Which will generate our good sentences, but also bad sentences like "He eat a sandwich" and "They eats a sandwich". We can fix this by adding labels to our non-terminals:
Now life is great: we will no longer produce our bogus sentences. (Note that we would also have to account for "a" being singular, etc...)
Of course there are other phenomena we might want to capture, for instance pronouns sitting in object position have to change case: "A sandwich eats him" not "A sandwich eats he". If we ignore the singular/plural distinction, we can solve this separately.
And so on...
We've now solve the nominative/accusative issue, so we can combine the sing/plur grammar with the nom/acc grammar. The problem is exponential blowup: handling one roughly doubled the size of the grammar, handling the other doubles it again. In general, for every agreement-like phenomenon we want to handle, we'll double the size of our grammar again. This is bad (a) because it makes our grammar huge and (b) because we'll need tons and tons and tons of data to get reliable estimates of these probabilities from a treebank!
(Other English phenomena you might want to capture are selectional preferences -- which are really more semantics issues than syntactic issues; or determiner/noun agreement.)
One solution to this problem is to augment our grammar with features. That is, basically do what linguists do and write "+sing" on the pre-terminal, but not split the grammar into bits. Instead, we need to come up with ways of propagating these features up the tree, and checking for disagreement. This process is known as "unification".
The idea of unification is that everything has a bunch of features with values. Values can be atomic (number=sing) or disjunctions (number={sing,plur}) or variables (number=A). Two values unify basically if they have a non-empty intersection (it's actually more complicated than this, but this is the basic idea). So sing and sing unify and produce sing. Similarly sing and {sing,plur} unify and produce sing. A and sing unify and produce sing with a note off on the side of the page that A=sing. sing and plur don't unify. Our grammar will now encode what unifications happen at each step (eg., when producing a S, you want to unify the number feature of the NP and VP; or when producing a VP, you want to unify the object NP with "case=accusative" and likewise for subject NPs).
Our grammar is now a set of context free productions and unification rules. One nice thing is that chart parsing basically still works, but our parse cells become more complicated because they have to remember the values of features. But at least our grammars don't blow up!
Suppose we want to capture this with a (P)CFG.... the only real way to do this is to expand our set of non-terminals. I.e., we might start with a grammar like:
S -> NP VP NP -> Pro NP -> Det Noun VP -> Verb NP Pro -> he Pro -> they Verb -> eat Verb -> eats Det -> a
Which will generate our good sentences, but also bad sentences like "He eat a sandwich" and "They eats a sandwich". We can fix this by adding labels to our non-terminals:
S -> NP+sing VP+sing S -> NP+plur VP+plur NP+sing -> Pro+sing NP+plur -> Pro+plur NP+sing -> Det Noun+sing NP+plur -> Det Noun+plur VP+sing -> Verb+sing NP VP+plur -> Verb+plur NP Pro+sing -> he Pro+plur -> they Verb+sing -> eats Verb+plur -> eat Det -> a
Now life is great: we will no longer produce our bogus sentences. (Note that we would also have to account for "a" being singular, etc...)
Of course there are other phenomena we might want to capture, for instance pronouns sitting in object position have to change case: "A sandwich eats him" not "A sandwich eats he". If we ignore the singular/plural distinction, we can solve this separately.
S -> NP+nom VP VP -> Verb NP+acc NP+nom -> Pro+nom NP+acc -> Pro+acc NP -> Det Noun Pro+nom -> he Pro+acc -> him Pro+nom -> they Pro+acc -> them
And so on...
We've now solve the nominative/accusative issue, so we can combine the sing/plur grammar with the nom/acc grammar. The problem is exponential blowup: handling one roughly doubled the size of the grammar, handling the other doubles it again. In general, for every agreement-like phenomenon we want to handle, we'll double the size of our grammar again. This is bad (a) because it makes our grammar huge and (b) because we'll need tons and tons and tons of data to get reliable estimates of these probabilities from a treebank!
(Other English phenomena you might want to capture are selectional preferences -- which are really more semantics issues than syntactic issues; or determiner/noun agreement.)
One solution to this problem is to augment our grammar with features. That is, basically do what linguists do and write "+sing" on the pre-terminal, but not split the grammar into bits. Instead, we need to come up with ways of propagating these features up the tree, and checking for disagreement. This process is known as "unification".
The idea of unification is that everything has a bunch of features with values. Values can be atomic (number=sing) or disjunctions (number={sing,plur}) or variables (number=A). Two values unify basically if they have a non-empty intersection (it's actually more complicated than this, but this is the basic idea). So sing and sing unify and produce sing. Similarly sing and {sing,plur} unify and produce sing. A and sing unify and produce sing with a note off on the side of the page that A=sing. sing and plur don't unify. Our grammar will now encode what unifications happen at each step (eg., when producing a S, you want to unify the number feature of the NP and VP; or when producing a VP, you want to unify the object NP with "case=accusative" and likewise for subject NPs).
Our grammar is now a set of context free productions and unification rules. One nice thing is that chart parsing basically still works, but our parse cells become more complicated because they have to remember the values of features. But at least our grammars don't blow up!
07 October 2010
Lecture 10: Statistical Parsing
Context free grammars are great, but as with any X, if X is great, then probabilistic X is greater still!
A probabilistic context free grammar (PCFG) is a CFG with "weights" attached to each of the rules. These weights must be non-negative. For any left-hand-side, the sum of weights must be one. The following is a small PCFG:
Note that for any given LHS (eg., "NP") the sum of weights (probabilities) is one.
The probability of a tree is just the product of probabilities of rules used in that tree.
We can generate random sentences with a PCFG by starting at "S" and then picking a rule at random (weighted by the probabilities) and then recursively generating the corresponding children.
The probability of a sentence under this grammar is obtained by summing (marginalizing) out all the trees that can generate that sentence.
To parse with a PCFG, we just parse with the corresponding CFG, but then weigh items in the chart by their probabilities. If we find a higher-probability derivation of the same item in the chart, we throw out the old one and replace it with the new one. In CKY this is easy.
A treebank is a collection of human-labeled trees. Given a treebank, we can extract the corresponding PCFG by counting productions and normalizing. Most trees are not binary so you need to binarize your rules.
Note on smoothing: you have to smooth if you want your grammar to be able to parse out-of-sample trees. Smoothing can be done just like in HMMs, but you want to be careful not to compute stuff that is irrelevant.
Note on efficiency: The grammars we produce are huge (tens or hundreds of thousands of rules). As a result, paying O(N^3|G|) is often prohibitive, even for N around 30 or 40. A simple solution is pruning: after you process each cell in the chart, throw out it's low probability items. You can either do count-based pruning (only keep the top 5 in each cell), ratio-based pruning (only keep the ones within a factor of 10 probability of the best one), or probability-based pruning (only keep the ones with probability at least 10e-6). Usually the last sucks, but the first two work reasonably well.
A probabilistic context free grammar (PCFG) is a CFG with "weights" attached to each of the rules. These weights must be non-negative. For any left-hand-side, the sum of weights must be one. The following is a small PCFG:
S 0.5 -> NP VP S 0.5 -> VP NP 0.7 -> Det Noun NP 0.2 -> Proper NP 0.1 -> NP PP VP 0.6 -> Verb NP VP 0.3 -> Verb VP 0.1 -> VP PP
Note that for any given LHS (eg., "NP") the sum of weights (probabilities) is one.
The probability of a tree is just the product of probabilities of rules used in that tree.
We can generate random sentences with a PCFG by starting at "S" and then picking a rule at random (weighted by the probabilities) and then recursively generating the corresponding children.
The probability of a sentence under this grammar is obtained by summing (marginalizing) out all the trees that can generate that sentence.
To parse with a PCFG, we just parse with the corresponding CFG, but then weigh items in the chart by their probabilities. If we find a higher-probability derivation of the same item in the chart, we throw out the old one and replace it with the new one. In CKY this is easy.
A treebank is a collection of human-labeled trees. Given a treebank, we can extract the corresponding PCFG by counting productions and normalizing. Most trees are not binary so you need to binarize your rules.
Note on smoothing: you have to smooth if you want your grammar to be able to parse out-of-sample trees. Smoothing can be done just like in HMMs, but you want to be careful not to compute stuff that is irrelevant.
Note on efficiency: The grammars we produce are huge (tens or hundreds of thousands of rules). As a result, paying O(N^3|G|) is often prohibitive, even for N around 30 or 40. A simple solution is pruning: after you process each cell in the chart, throw out it's low probability items. You can either do count-based pruning (only keep the top 5 in each cell), ratio-based pruning (only keep the ones within a factor of 10 probability of the best one), or probability-based pruning (only keep the ones with probability at least 10e-6). Usually the last sucks, but the first two work reasonably well.
04 October 2010
No class Tue 05 Oct 2010
Due to Hal being sick again. Sigh.
So here's a puzzle to entertain you while you're not learning all about treebanking and PCFGs....
Suppose Hal presents symptoms that make it 90% certain that he has some illness X. And suppose that the rate of transmission of X is 5%. Suppose that the average rate of incidence of X in a general population is 0.01%. What is the probability that you would get sick if we had class tomorrow (and you showed up)? There are 45 students in class: what's the probability that at least one would get sick? Finally, suppose that your utility function is defined as U(learn about pcfgs and don't get sick) = +10, U(don't learn and don't get sick) = 0, U(learn and get sick) = -20, U(don't learn and get sick) = -100. (How annoying would that be.) What is the expected utility if I do teach class? What is the expected utility if I cancel class? Am I doing the right thing?
Feel free to post solutions :).
So here's a puzzle to entertain you while you're not learning all about treebanking and PCFGs....
Suppose Hal presents symptoms that make it 90% certain that he has some illness X. And suppose that the rate of transmission of X is 5%. Suppose that the average rate of incidence of X in a general population is 0.01%. What is the probability that you would get sick if we had class tomorrow (and you showed up)? There are 45 students in class: what's the probability that at least one would get sick? Finally, suppose that your utility function is defined as U(learn about pcfgs and don't get sick) = +10, U(don't learn and don't get sick) = 0, U(learn and get sick) = -20, U(don't learn and get sick) = -100. (How annoying would that be.) What is the expected utility if I do teach class? What is the expected utility if I cancel class? Am I doing the right thing?
Feel free to post solutions :).
01 October 2010
Projec 2 is posted, due Oct 19
It's all about parsing. The writeup is really long, but don't be too intiminated by it. It's long because I tried to give a lot of debugging help. The project is all about grammars and parsing, so go off and do syntax!
Subscribe to:
Posts (Atom)