04 November 2010

P3 is posted, due Nov 23

22 comments:

  1. Is it acceptable for us to use ML tools other than the ones you've provided, e.g. WEKA?

    ReplyDelete
  2. I've installed megam and it gave me a slight different result: Error rate = 75 / 400 = 0.1875 Is there a problem?

    ReplyDelete
  3. For tie-breaking in the grid search over sample size and regularization params, you say to choose the "more regularized" value. In the case of LR, this means higher values; but in the case of DT, is this lower values?

    ReplyDelete
  4. @Chris: I assume you mean for the final section. For that, I'd prefer if you used megam or FastDT.... I want your improvements coming from better features, not from over tweaking the ML algorithm.

    @inin: I think it's just a versioning difference. Don't worry about it.

    @Ben: that sounds reasonable to me :).

    ReplyDelete
  5. Hal - The link to the gender data in section 3 is broken.

    http://www.umiacs.umd.edu/~hal/courses/2010F_CL1/out/gender.tar.gz

    ReplyDelete
  6. For those Mac users who don't want to install Ocaml, I've post my binaries of FastDT and megam.opt at and . Both are Mach-O 64-bit executables.

    Use those binaries at your own risk. Or, you could just compile them yourself. Here's a brief how-to:

    (1) Install Ocaml, either via your favorite package manager or use the dmg at .
    (2) FastDT should compile just by "ocamlopt str.cmxa bigarray.cmxa FastDT.ml -o FastDT".
    (3) To compile megam, open the Makefile, find -lstr, replace it with -lcamlstr; find WITHCLIBS, make the include path your actual one (try "find OcamlDirectory -name bigarray.h").

    This how-to is not Mac OS X specific. In fact you need step (3) on any system with OCaml 3.12.

    ReplyDelete
  7. Weird. Blogspot has eaten all my links.

    FastDT is at http://dl.dropbox.com/u/14660178/FastDT

    megam is at http://dl.dropbox.com/u/14660178/megam.opt

    OCaml installer from its official site is at http://caml.inria.fr/pub/distrib/ocaml-3.12/ocaml-3.12.0-intel.dmg

    ReplyDelete
  8. For a less agnostic how-to on compiling megam on a Mac, assuming you have macports:

    1. sudo port install ocaml
    2. Unpack the sources and cd into megam-0.92
    3. Change the Makefile to so your WITHCLIBS reads:
    WITHCLIBS =-I /usr/lib/ocaml/caml -I /opt/local/lib/ocaml/caml
    4. make

    ReplyDelete
  9. Are we allowed to use other options in the MegaM classifier? Like the feature selection option? (must we know in detail how it work?)

    ReplyDelete
  10. The test.txt in gender.tar.gz only has 2000 lines. I think it's supposed to be 10,000. I combined the male and female with something like: cat female.txt male.txt | sort -R > test.txt

    ReplyDelete
  11. Actually, I still can't get the defaultExtractor.pl example to work right (it ends up with Error rate = 0 / 1000 = 0)

    ReplyDelete
  12. @All: thanks for all the posts about macs ;).

    @Teo: Like above, I'd prefer if the gains came from better features rather than better learning. That said, I've never had huge success with the feature selection, so if you can get it to work, that might be moderately interesting :).

    ReplyDelete
  13. @Nish: no, test.txt is the test data. male.txt and female.txt are the training data. there are 10000 training points but only 2000 test points.

    ReplyDelete
  14. @Nish: hrm... can you tell me the commands you're running? i just reran it myself and it worked fine.

    ReplyDelete
  15. Try editing the defaultExtractor.pl code. Change if(0) to if(1) will get it to extract 10,000 lines from the full data.

    ReplyDelete
  16. @Teo: right, in one setting it generates training data... in the other setting it generates test data. Of course other than by submuitting the test predictions you won't be able to tell how well it's doing, since you don't know the true test labels.

    ReplyDelete
  17. I found that the "male.parsed" and "female.parsed" files is each sentence one line while the "male.txt" and "female.txt" are each blog one line. I tried to use split('\t') to split the sentences in ".txt." files, but it seems that the sentence number is not the same with the sentence number of ".parsed" files... Any suggestions?... Thanks~~

    ReplyDelete
  18. @anonymous: that's because there are <P^gt; lines between posts in the parsed files...

    % wc -l male.parsed
    53920 male.parsed
    % cat male.txt | tr '\t' '\n' | wc -l
    48921
    % grep '^<P>$' male.parsed | wc -l
    4999

    and 4999+48921 = 53920

    ReplyDelete
  19. How many tokens have been assigned to each person in the section 3??

    ReplyDelete
  20. Yes, any word on tokens? Is it 4 per 72 hours (?) or just 4 total (I hope not)?

    ReplyDelete
  21. it's 4 per 72 hours... but i just changed it so it's 5 per 48 hours.... i'm not sure if there was a bug or something but just in case, think of this as a gift token :).

    ReplyDelete