UMD CMSC 723: Computational Linguistics I: P4: Small error in example numbers....

05 December 2010

P4: Small error in example numbers....

There are some sentences in the training data that contain a "TAB" character. The reasonable thing to do would be just to consider this as whitespace. For some reason I didn't do this. In my example of DF computation, I did this. Which somewhat changes all the remaining numbers.

Instead of rerunning everything I'll just tell you what the updated top frequency words are if you do it "properly." In general, for this assignment, don't worry if your numbers are slightly different than mine -- it may have to do with how you handle the non-ascii characters that appear once in a while in the data.

6 comments:

Shopan12 December, 2010 16:52
should we consider punctuation marks like . , ; as words ? and the firs ttwo sentences do have the word 'series' common, so similarity does not appear to be 0..
ReplyDelete
Replies
hal12 December, 2010 20:58
@Shopan: I consider every token (including punctuation) as a "word". The first two sentences *don't* have a word in common, they are:

1. `` Main article : Animax ''

2. Being Sony 's first attempt to offer a 24-hour anime channel in Latin America , it was thought to feature series in two formats .
ReplyDelete
Replies
Anonymous13 December, 2010 01:25
According to the equation, it takes into account only the common words between two sentence, not the position of their words ?

so, "I eat " and "eat I" will have similarity = 1. Is that correct ?

If a word belongs to both sentences [ assume same document], its tf-idf value for both sentences will be same ?

so, sum(aw,bw)= root ( sqare (tf-idf(eat) )+ sqare (tf-idf(I) ) ) ?
ReplyDelete
Replies
hal13 December, 2010 01:49
Righto -- order is thrown out (for better or for worse)
ReplyDelete
Replies
D13 December, 2010 17:41
I think what Shopan ment was that 2nd and 3rd sentences do have 'series' and ',' in common, and similarity between the sentences is > 0
ReplyDelete
Replies
hal13 December, 2010 17:48
Oh. I see. It's because the first zero is the similarity between sentence 1 (the first one in the file) and everything that came before it (which is nothing, so it's zero). The second zero is the similarity between the words before the second sentence and the second sentence. Which is also zero. It's not until you get to the third sentence that you get any repetition.
ReplyDelete
Replies

Add comment