For tie-breaking in the grid search over sample size and regularization params, you say to choose the "more regularized" value. In the case of LR, this means higher values; but in the case of DT, is this lower values?
@Chris: I assume you mean for the final section. For that, I'd prefer if you used megam or FastDT.... I want your improvements coming from better features, not from over tweaking the ML algorithm.
@inin: I think it's just a versioning difference. Don't worry about it.
For those Mac users who don't want to install Ocaml, I've post my binaries of FastDT and megam.opt at and . Both are Mach-O 64-bit executables.
Use those binaries at your own risk. Or, you could just compile them yourself. Here's a brief how-to:
(1) Install Ocaml, either via your favorite package manager or use the dmg at . (2) FastDT should compile just by "ocamlopt str.cmxa bigarray.cmxa FastDT.ml -o FastDT". (3) To compile megam, open the Makefile, find -lstr, replace it with -lcamlstr; find WITHCLIBS, make the include path your actual one (try "find OcamlDirectory -name bigarray.h").
This how-to is not Mac OS X specific. In fact you need step (3) on any system with OCaml 3.12.
For a less agnostic how-to on compiling megam on a Mac, assuming you have macports:
1. sudo port install ocaml 2. Unpack the sources and cd into megam-0.92 3. Change the Makefile to so your WITHCLIBS reads: WITHCLIBS =-I /usr/lib/ocaml/caml -I /opt/local/lib/ocaml/caml 4. make
The test.txt in gender.tar.gz only has 2000 lines. I think it's supposed to be 10,000. I combined the male and female with something like: cat female.txt male.txt | sort -R > test.txt
@Teo: Like above, I'd prefer if the gains came from better features rather than better learning. That said, I've never had huge success with the feature selection, so if you can get it to work, that might be moderately interesting :).
@Teo: right, in one setting it generates training data... in the other setting it generates test data. Of course other than by submuitting the test predictions you won't be able to tell how well it's doing, since you don't know the true test labels.
I found that the "male.parsed" and "female.parsed" files is each sentence one line while the "male.txt" and "female.txt" are each blog one line. I tried to use split('\t') to split the sentences in ".txt." files, but it seems that the sentence number is not the same with the sentence number of ".parsed" files... Any suggestions?... Thanks~~
it's 4 per 72 hours... but i just changed it so it's 5 per 48 hours.... i'm not sure if there was a bug or something but just in case, think of this as a gift token :).
Is it acceptable for us to use ML tools other than the ones you've provided, e.g. WEKA?
ReplyDeleteI've installed megam and it gave me a slight different result: Error rate = 75 / 400 = 0.1875 Is there a problem?
ReplyDeleteFor tie-breaking in the grid search over sample size and regularization params, you say to choose the "more regularized" value. In the case of LR, this means higher values; but in the case of DT, is this lower values?
ReplyDelete@Chris: I assume you mean for the final section. For that, I'd prefer if you used megam or FastDT.... I want your improvements coming from better features, not from over tweaking the ML algorithm.
ReplyDelete@inin: I think it's just a versioning difference. Don't worry about it.
@Ben: that sounds reasonable to me :).
Hal - The link to the gender data in section 3 is broken.
ReplyDeletehttp://www.umiacs.umd.edu/~hal/courses/2010F_CL1/out/gender.tar.gz
@Tandeep: fixed, sorry!
ReplyDeleteFor those Mac users who don't want to install Ocaml, I've post my binaries of FastDT and megam.opt at and . Both are Mach-O 64-bit executables.
ReplyDeleteUse those binaries at your own risk. Or, you could just compile them yourself. Here's a brief how-to:
(1) Install Ocaml, either via your favorite package manager or use the dmg at .
(2) FastDT should compile just by "ocamlopt str.cmxa bigarray.cmxa FastDT.ml -o FastDT".
(3) To compile megam, open the Makefile, find -lstr, replace it with -lcamlstr; find WITHCLIBS, make the include path your actual one (try "find OcamlDirectory -name bigarray.h").
This how-to is not Mac OS X specific. In fact you need step (3) on any system with OCaml 3.12.
Weird. Blogspot has eaten all my links.
ReplyDeleteFastDT is at http://dl.dropbox.com/u/14660178/FastDT
megam is at http://dl.dropbox.com/u/14660178/megam.opt
OCaml installer from its official site is at http://caml.inria.fr/pub/distrib/ocaml-3.12/ocaml-3.12.0-intel.dmg
For a less agnostic how-to on compiling megam on a Mac, assuming you have macports:
ReplyDelete1. sudo port install ocaml
2. Unpack the sources and cd into megam-0.92
3. Change the Makefile to so your WITHCLIBS reads:
WITHCLIBS =-I /usr/lib/ocaml/caml -I /opt/local/lib/ocaml/caml
4. make
Are we allowed to use other options in the MegaM classifier? Like the feature selection option? (must we know in detail how it work?)
ReplyDeleteThe test.txt in gender.tar.gz only has 2000 lines. I think it's supposed to be 10,000. I combined the male and female with something like: cat female.txt male.txt | sort -R > test.txt
ReplyDeleteActually, I still can't get the defaultExtractor.pl example to work right (it ends up with Error rate = 0 / 1000 = 0)
ReplyDelete@All: thanks for all the posts about macs ;).
ReplyDelete@Teo: Like above, I'd prefer if the gains came from better features rather than better learning. That said, I've never had huge success with the feature selection, so if you can get it to work, that might be moderately interesting :).
@Nish: no, test.txt is the test data. male.txt and female.txt are the training data. there are 10000 training points but only 2000 test points.
ReplyDelete@Nish: hrm... can you tell me the commands you're running? i just reran it myself and it worked fine.
ReplyDeleteTry editing the defaultExtractor.pl code. Change if(0) to if(1) will get it to extract 10,000 lines from the full data.
ReplyDelete@Teo: right, in one setting it generates training data... in the other setting it generates test data. Of course other than by submuitting the test predictions you won't be able to tell how well it's doing, since you don't know the true test labels.
ReplyDeleteI found that the "male.parsed" and "female.parsed" files is each sentence one line while the "male.txt" and "female.txt" are each blog one line. I tried to use split('\t') to split the sentences in ".txt." files, but it seems that the sentence number is not the same with the sentence number of ".parsed" files... Any suggestions?... Thanks~~
ReplyDelete@anonymous: that's because there are <P^gt; lines between posts in the parsed files...
ReplyDelete% wc -l male.parsed
53920 male.parsed
% cat male.txt | tr '\t' '\n' | wc -l
48921
% grep '^<P>$' male.parsed | wc -l
4999
and 4999+48921 = 53920
How many tokens have been assigned to each person in the section 3??
ReplyDeleteYes, any word on tokens? Is it 4 per 72 hours (?) or just 4 total (I hope not)?
ReplyDeleteit's 4 per 72 hours... but i just changed it so it's 5 per 48 hours.... i'm not sure if there was a bug or something but just in case, think of this as a gift token :).
ReplyDelete