22 December 2010

Summer internships at JHU/COE

... in case anyone is reading this, I just got the following email. I know this program and it's good.

Please share with graduate and undergraduate students looking for
summer internships.

The Johns Hopkins University Human Language Technology Center of
Excellence (COE) is seeking candidates for our summer internship
program as part of SCALE, our yearly summer workshop (Summer Camp for
Advanced Language Exploration.) Interns will work on research in
speech, text and graph processing as part of a larger workshop team.

Internship positions are fully funded, including travel, living
expenses and stipend.

This summer's workshop topic is "Combining Speech, Text and Graphs,"
which will draw from a number of research areas:

*** Natural Language Processing ***
- machine translation and NLP on informal texts
- opinion extraction from informal texts
- topic models
- information extraction
- domain adaptation and transfer learning
- multi-lingual learning

*** Speech ***
- topic detection
- keyword spotting
- resource constrained speech processing
- robust speech processing

*** Graphs ***
- finding communities in social networks
- anomaly detection
- learning on large scale graphs
- graph-based topic modeling

Candidates should be currently enrolled in an undergraduate or
graduate degree program. Applications submitted by Friday Jan 14, 2011
will receive full consideration.

For more information: http://web.jhu.edu/HLTCOE/scaleinterns2011.html

Applicants will be required to obtain a US security clearance, which
requires US citizenship.  If you do not already have a clearance, we
will work with you to obtain one.

21 December 2010

Grades (Almost) Posted, Semester Over

Hi all --

I figure you're likely to read to the end to find out about grades, so before I get to that, let me just take this chance to say that I really enjoyed this class this semester.  You were all great.  Everyone did awesome on both the final exam and final projects, and I'm really thrilled.  If you couldn't tell already, I love language stuff and I encourage you all to continue on and learn everything there is to know.  Philip is teaching the follow-on course in the Spring, which should be awesome.  I'm also running an unofficial seminar on "Large Data" stuff in the spring; you can get more info here (sign up for the mailing list if you're interested).  Anyway, I had a great time teaching; I hope you had fun in class.

Regarding grades, I need to submit them by midnight tonight.  And since I don't plan on staying up until midnight, this really means tonight by about 11p.

I've posted "unofficial" grades on grades.cs.umd.edu, so you can see what your grade is.  Note that the "total" column on that spreadsheet is completely misleading, since it doesn't include all the weirdo grading rules (dropping of worst projects/homeworks, inclusion of extra credit, etc.).  I have all the numbers in a separate spreadsheet, so if something looks odd to you and you'd like the full computation, please let me know.  It's of course possible to change grades later, but it's a pain, so I'd rather hear about any issues now.

That's it.  Have a great break and I hope to see some of you again in the Spring!

 -h


ps., per second comment below, I added an extra column, MyOverall. The grading is as follows:

98 = A+
95 = A
92 = A-
88 = B+
85 = B
82 = B-
78 = C+
75 = C
72 = C-

Note that your score will be exactly one of these numbers: This is just my way of encoding your grade. This isn't actually what your score was :).

12 December 2010

Interested in Language Science?

Just got the following email from Colin Philips in Linguistics / Cognitive Neuroscience.  This is regarding language science. Please see below... feel free to email me if you have questions:

I'm hoping that you can help us to reach students in the CS/iSchool universe who might be interested in taking advantage of our unique interdisciplinary language science opportunities. We're particularly interested in reaching 1st and 2nd year students. We'll be holding an informational meeting for students tomorrow at 1pm in 1108B Marie Mount, but I'd be happy to meet at another time with anybody who is interested but not available at that time. We'll tell students about the opportunities and benefits, and also talk about the resources that are available to help them, including new plans to help them to develop interdisciplinary training plans that are both innovative and feasible. Csilla Kajtar already circulated a message about this, but we know that people often just ignore messages sent to mailing lists.

As you know, the closer integration of NLP and cognitive work in language is right at the top of our list of opportunities-that-we'd-be-idiots-not-to-pursue, and student training is one of the best ways to achieve this.

09 December 2010

Final Exam, Due Dec 17, 3:30pm

Here's a copy of the final exam as well as the source LaTeX.  Please feel free to either print it and do it by hand, or to do it in LaTeX and print the solution.  You may turn it in any time between now and 3:30pm on Dec 17.  (Because our official exam time is 1:30-3:30 on Dec 17.)  Please hand it in in one of three ways: (1) give it to me in person in my office or otherwise; (2) slide it completely under my office door (AVW 3227); (3) give it to Amit in person.

If you have any clarification questions, please post them here.

06 December 2010

05 December 2010

P4: Small error in example numbers....

There are some sentences in the training data that contain a "TAB" character.  The reasonable thing to do would be just to consider this as whitespace.  For some reason I didn't do this.  In my example of DF computation, I did this.  Which somewhat changes all the remaining numbers.

Instead of rerunning everything I'll just tell you what the updated top frequency words are if you do it "properly."  In general, for this assignment, don't worry if your numbers are slightly different than mine -- it may have to do with how you handle the non-ascii characters that appear once in a while in the data.

   2999 .
   2999 ,
   2998 of
   2997 the
   2997 and
   2994 in
   2989 to
   2988 a
   2885 as
   2875 by
   2862 for
   2860 )
   2859 (
   2836 with
   2832 that
   2801 ''
   2788 ``
   2759 on
   2717 from

Last seminar of the semester: Michael Paul Dec 8, 11am

December 8: Michael Paul: Summarizing Contrastive Viewpoints in Opinionated Text
AVW 2120

Performing multi-document summarization of opinionated text has unique
challenges because it is important to recognize that the same
information may be presented in different ways from different
viewpoints. In this talk, we will present a special kind of
contrastive summarization approach intended to highlight this
phenomenon and to help users digest conflicting opinions. To do this,
we introduce a new graph-based algorithm, Comparative LexRank, to
score sentences in a summary based on a combination of both
representativeness of the collection and comparability between
opposing viewpoints. We then address the issue of how to automatically
discover and extract viewpoints from unlabeled text, and we experiment
with a novel two-dimensional topic model for the task of unsupervised
clustering of documents by viewpoint. Finally, we discuss how these
two stages can be combined to both automatically extract and summarize
viewpoints in an interesting way. Results are presented on two
political opinion data sets.
This project was joint work with ChengXiang Zhai and Roxana Girju.


Bio: Michael Paul is a first-year Ph.D. student of Computer Science at
the Johns Hopkins University and a member of the Center for Language
and Speech Processing. He earned a B.S. from the University of
Illinois at Urbana-Champaign in 2009. He is currently a Graduate
Research Fellow of the National Science Foundation and a Dean's Fellow
of the Whiting School of Engineering.

02 December 2010

Lecture 25: Mapping Text to Actions

There has been a bunch of work recently on trying to automatically find relationships between language and the "real world", where "real world" actually often means some sort of simulated environment.  Here are a few papers along these lines:
There are others, of course, but these five form a fairly diverse example set.  There's not much work on trying to use the real world, but robotics people like Nick Roy (at MIT) are trying to make headway on this problem.

In the first paper, which is the one we'll talk about most, the key idea is that of hierarchical plans, represented as a pcfg.  For instance we might have a rule "OfferCup -> PickUpCup MoveCup ReleaseCup", where each of the subactions might either be atomic (correspond to actual muscle movements) or might itself be broken down further.  (Qustion: how context free is this problem?)

The key ambiguity is due to the fact that actions do not select for exactly one interpretation, as in the Blicket example.

In this paper, they hand constructed a PCFG for actions and the key learning question was whether you could figure out the level of ambiguity automatically.  The basic idea is to look at relative frequencies of occurance between lexical items and nodes in the PCFG tree for the actions.

01 December 2010

P4, grading rules

So P4 has been posted for a while.  It is "optional" in the sense that your project grades will be based on your best three out of four grades.  In particular, here's what will happen.

Suppose that your grades on P1, ..., P4 are a,b,c,d.  (If you don't do P4, then d=0.)

Let x = [ (a + b + c + d) - min { a, b, c, d } ] / 3

Then x is your average grade on your best three projects.


We will use x as your overall project grade (i.e., since each project is weighed equally, it will be like you got a score of x on all FOUR of them).