UMD CMSC 723: Computational Linguistics I: P3 is posted, due Nov 23

04 November 2010

P3 is posted, due Nov 23

22 comments:

Chris04 November, 2010 15:50
Is it acceptable for us to use ML tools other than the ones you've provided, e.g. WEKA?
ReplyDelete
Replies
inin05 November, 2010 17:45
I've installed megam and it gave me a slight different result: Error rate = 75 / 400 = 0.1875 Is there a problem?
ReplyDelete
Replies
Ben05 November, 2010 18:02
For tie-breaking in the grid search over sample size and regularization params, you say to choose the "more regularized" value. In the case of LR, this means higher values; but in the case of DT, is this lower values?
ReplyDelete
Replies
hal08 November, 2010 12:56
@Chris: I assume you mean for the final section. For that, I'd prefer if you used megam or FastDT.... I want your improvements coming from better features, not from over tweaking the ML algorithm.

@inin: I think it's just a versioning difference. Don't worry about it.

@Ben: that sounds reasonable to me :).
ReplyDelete
Replies
Tandeep08 November, 2010 23:26
Hal - The link to the gender data in section 3 is broken.

http://www.umiacs.umd.edu/~hal/courses/2010F_CL1/out/gender.tar.gz
ReplyDelete
Replies
hal09 November, 2010 07:17
@Tandeep: fixed, sorry!
ReplyDelete
Replies
Unknown11 November, 2010 16:26
For those Mac users who don't want to install Ocaml, I've post my binaries of FastDT and megam.opt at and . Both are Mach-O 64-bit executables.

Use those binaries at your own risk. Or, you could just compile them yourself. Here's a brief how-to:

(1) Install Ocaml, either via your favorite package manager or use the dmg at .
(2) FastDT should compile just by "ocamlopt str.cmxa bigarray.cmxa FastDT.ml -o FastDT".
(3) To compile megam, open the Makefile, find -lstr, replace it with -lcamlstr; find WITHCLIBS, make the include path your actual one (try "find OcamlDirectory -name bigarray.h").

This how-to is not Mac OS X specific. In fact you need step (3) on any system with OCaml 3.12.
ReplyDelete
Replies
Unknown11 November, 2010 20:26
Weird. Blogspot has eaten all my links.

FastDT is at http://dl.dropbox.com/u/14660178/FastDT

megam is at http://dl.dropbox.com/u/14660178/megam.opt

OCaml installer from its official site is at http://caml.inria.fr/pub/distrib/ocaml-3.12/ocaml-3.12.0-intel.dmg
ReplyDelete
Replies
Puuj14 November, 2010 23:40
For a less agnostic how-to on compiling megam on a Mac, assuming you have macports:

1. sudo port install ocaml
2. Unpack the sources and cd into megam-0.92
3. Change the Makefile to so your WITHCLIBS reads:
WITHCLIBS =-I /usr/lib/ocaml/caml -I /opt/local/lib/ocaml/caml
4. make
ReplyDelete
Replies
Unknown16 November, 2010 23:06
Are we allowed to use other options in the MegaM classifier? Like the feature selection option? (must we know in detail how it work?)
ReplyDelete
Replies
Unknown17 November, 2010 01:17
The test.txt in gender.tar.gz only has 2000 lines. I think it's supposed to be 10,000. I combined the male and female with something like: cat female.txt male.txt | sort -R > test.txt
ReplyDelete
Replies
Unknown17 November, 2010 01:20
Actually, I still can't get the defaultExtractor.pl example to work right (it ends up with Error rate = 0 / 1000 = 0)
ReplyDelete
Replies
hal17 November, 2010 08:14
@All: thanks for all the posts about macs ;).

@Teo: Like above, I'd prefer if the gains came from better features rather than better learning. That said, I've never had huge success with the feature selection, so if you can get it to work, that might be moderately interesting :).
ReplyDelete
Replies
hal17 November, 2010 08:18
@Nish: no, test.txt is the test data. male.txt and female.txt are the training data. there are 10000 training points but only 2000 test points.
ReplyDelete
Replies
hal17 November, 2010 08:19
@Nish: hrm... can you tell me the commands you're running? i just reran it myself and it worked fine.
ReplyDelete
Replies
Unknown17 November, 2010 09:09
Try editing the defaultExtractor.pl code. Change if(0) to if(1) will get it to extract 10,000 lines from the full data.
ReplyDelete
Replies
hal17 November, 2010 10:46
@Teo: right, in one setting it generates training data... in the other setting it generates test data. Of course other than by submuitting the test predictions you won't be able to tell how well it's doing, since you don't know the true test labels.
ReplyDelete
Replies
Anonymous17 November, 2010 16:35
I found that the "male.parsed" and "female.parsed" files is each sentence one line while the "male.txt" and "female.txt" are each blog one line. I tried to use split('\t') to split the sentences in ".txt." files, but it seems that the sentence number is not the same with the sentence number of ".parsed" files... Any suggestions?... Thanks~~
ReplyDelete
Replies
hal17 November, 2010 17:40
@anonymous: that's because there are <P^gt; lines between posts in the parsed files...

% wc -l male.parsed
53920 male.parsed
% cat male.txt | tr '\t' '\n' | wc -l
48921
% grep '^<P>$' male.parsed | wc -l
4999

and 4999+48921 = 53920
ReplyDelete
Replies
Puneet20 November, 2010 22:06
How many tokens have been assigned to each person in the section 3??
ReplyDelete
Replies
Bryan22 November, 2010 03:11
Yes, any word on tokens? Is it 4 per 72 hours (?) or just 4 total (I hope not)?
ReplyDelete
Replies
hal22 November, 2010 08:24
it's 4 per 72 hours... but i just changed it so it's 5 per 48 hours.... i'm not sure if there was a bug or something but just in case, think of this as a gift token :).
ReplyDelete
Replies

Add comment