Instead of rerunning everything I'll just tell you what the updated top frequency words are if you do it "properly." In general, for this assignment, don't worry if your numbers are slightly different than mine -- it may have to do with how you handle the non-ascii characters that appear once in a while in the data.
2999 . 2999 , 2998 of 2997 the 2997 and 2994 in 2989 to 2988 a 2885 as 2875 by 2862 for 2860 ) 2859 ( 2836 with 2832 that 2801 '' 2788 `` 2759 on 2717 from
should we consider punctuation marks like . , ; as words ? and the firs ttwo sentences do have the word 'series' common, so similarity does not appear to be 0..
ReplyDelete@Shopan: I consider every token (including punctuation) as a "word". The first two sentences *don't* have a word in common, they are:
ReplyDelete1. `` Main article : Animax ''
2. Being Sony 's first attempt to offer a 24-hour anime channel in Latin America , it was thought to feature series in two formats .
According to the equation, it takes into account only the common words between two sentence, not the position of their words ?
ReplyDeleteso, "I eat " and "eat I" will have similarity = 1. Is that correct ?
If a word belongs to both sentences [ assume same document], its tf-idf value for both sentences will be same ?
so, sum(aw,bw)= root ( sqare (tf-idf(eat) )+ sqare (tf-idf(I) ) ) ?
Righto -- order is thrown out (for better or for worse)
ReplyDeleteI think what Shopan ment was that 2nd and 3rd sentences do have 'series' and ',' in common, and similarity between the sentences is > 0
ReplyDeleteOh. I see. It's because the first zero is the similarity between sentence 1 (the first one in the file) and everything that came before it (which is nothing, so it's zero). The second zero is the similarity between the words before the second sentence and the second sentence. Which is also zero. It's not until you get to the third sentence that you get any repetition.
ReplyDelete