-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Two-phase clustering #1
Comments
Looks good. In the document case, I'd estimate the cutoff from a sample. In the second case, I'd get the top per document pair. |
09b9b4b is a very rough first cut of this. We're using document pairs with over 0.025 match, then the top 10 sentences by overlap. The results will be in schwa07:/data1/gigacluster/doc-sentence-0.025*. I'll need a bit more time to double check things are correct, but I wanted to get some initial results there. Have a look at the source (https://github.com/schwa-lab/gigacluster/blob/master/gigacluster/comparators.py#L34) for what's printed in what order... |
FYI, just entering |
0.025 sounds small (but I know cosine similarities can be small because of the norm). Maybe we want something much smaller than a decile. |
Yeah, I forgot to enter #1 in the commit message, and then went to paste the whole sha 😸 The processing as-is took just under 3h (162m) for the 2003 slice of data: I'll test a bit more thoroughly now. Happy to kick it off again with a more principled cut-off for the sentence similarity. Let me know specifics, or post-process the current results. |
(I meant you don't need On 24 April 2014 19:57, Will Radford [email protected] wrote:
|
(Must have been a Redmine hangover...) Ok, so the extra testing revealed a bug (good and bad). I can't have been concentrating when I was writing the comparison algorithm, so we were only comparing some of the documents in the streams ("the diamond", which is appropriate for clustering, but not for what we're doing -- we have streams). Happily, this should have us comparing more documents and more sentences in matching documents. I've tuned against 2010-01-01 and am using thresholds of 0.029 for document similarity and 0.125 for sentence overlap. On my laptop this takes around 2m 30s. Running on schwa07 now, see |
These are finished: |
We want two stages of clustering:
sqrt(tf(term)) * idf(term)
where terms are case-sensitive unigrams, having removed stopwords and punctuation. Use a top-decile cut-off.Validation data is 2003.
Extensions might be to have a per-stream IDF model, case-insensitive IDF...
The text was updated successfully, but these errors were encountered: