05 December 2010

P4: Small error in example numbers....

There are some sentences in the training data that contain a "TAB" character.  The reasonable thing to do would be just to consider this as whitespace.  For some reason I didn't do this.  In my example of DF computation, I did this.  Which somewhat changes all the remaining numbers.

Instead of rerunning everything I'll just tell you what the updated top frequency words are if you do it "properly."  In general, for this assignment, don't worry if your numbers are slightly different than mine -- it may have to do with how you handle the non-ascii characters that appear once in a while in the data.

   2999 .
   2999 ,
   2998 of
   2997 the
   2997 and
   2994 in
   2989 to
   2988 a
   2885 as
   2875 by
   2862 for
   2860 )
   2859 (
   2836 with
   2832 that
   2801 ''
   2788 ``
   2759 on
   2717 from

6 comments:

  1. should we consider punctuation marks like . , ; as words ? and the firs ttwo sentences do have the word 'series' common, so similarity does not appear to be 0..

    ReplyDelete
  2. @Shopan: I consider every token (including punctuation) as a "word". The first two sentences *don't* have a word in common, they are:

    1. `` Main article : Animax ''

    2. Being Sony 's first attempt to offer a 24-hour anime channel in Latin America , it was thought to feature series in two formats .

    ReplyDelete
  3. According to the equation, it takes into account only the common words between two sentence, not the position of their words ?

    so, "I eat " and "eat I" will have similarity = 1. Is that correct ?

    If a word belongs to both sentences [ assume same document], its tf-idf value for both sentences will be same ?

    so, sum(aw,bw)= root ( sqare (tf-idf(eat) )+ sqare (tf-idf(I) ) ) ?

    ReplyDelete
  4. Righto -- order is thrown out (for better or for worse)

    ReplyDelete
  5. I think what Shopan ment was that 2nd and 3rd sentences do have 'series' and ',' in common, and similarity between the sentences is > 0

    ReplyDelete
  6. Oh. I see. It's because the first zero is the similarity between sentence 1 (the first one in the file) and everything that came before it (which is nothing, so it's zero). The second zero is the similarity between the words before the second sentence and the second sentence. Which is also zero. It's not until you get to the third sentence that you get any repetition.

    ReplyDelete