Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two-phase clustering #1

Closed
wejradford opened this issue Apr 23, 2014 · 9 comments
Closed

Two-phase clustering #1

wejradford opened this issue Apr 23, 2014 · 9 comments
Assignees

Comments

@wejradford
Copy link
Contributor

We want two stages of clustering:

  • Topic: using sqrt(tf(term)) * idf(term) where terms are case-sensitive unigrams, having removed stopwords and punctuation. Use a top-decile cut-off.
  • Sentences: report size of intersection; size of a, b; size of union; unweighted dot-product, IDF dot-product, norm. Use a top-quartile cut-off.

Validation data is 2003.

Extensions might be to have a per-stream IDF model, case-insensitive IDF...

@wejradford wejradford self-assigned this Apr 23, 2014
@jnothman
Copy link
Contributor

Looks good.

In the document case, I'd estimate the cutoff from a sample. In the second case, I'd get the top per document pair.

@wejradford
Copy link
Contributor Author

09b9b4b is a very rough first cut of this.

We're using document pairs with over 0.025 match, then the top 10 sentences by overlap. The results will be in schwa07:/data1/gigacluster/doc-sentence-0.025*.

I'll need a bit more time to double check things are correct, but I wanted to get some initial results there.

Have a look at the source (https://github.com/schwa-lab/gigacluster/blob/master/gigacluster/comparators.py#L34) for what's printed in what order...

@jnothman
Copy link
Contributor

FYI, just entering 09b9b4b5 will link the commit: 09b9b4b

@jnothman
Copy link
Contributor

0.025 sounds small (but I know cosine similarities can be small because of the norm). Maybe we want something much smaller than a decile.

@wejradford
Copy link
Contributor Author

Yeah, I forgot to enter #1 in the commit message, and then went to paste the whole sha 😸

The processing as-is took just under 3h (162m) for the 2003 slice of data: schwa07:/data1/gigacluster/doc-sentence-0.025.2003.clusters.txt

I'll test a bit more thoroughly now. Happy to kick it off again with a more principled cut-off for the sentence similarity. Let me know specifics, or post-process the current results.

@jnothman
Copy link
Contributor

(I meant you don't need commit:)

On 24 April 2014 19:57, Will Radford [email protected] wrote:

Yeah, I forgot to enter #1https://github.com/schwa-lab/gigacluster/issues/1in the commit message, and then went to paste the whole sha [image:
😸]

The processing as-is took just under 3h (162m) for the 2003 slice of data:
schwa07:/data1/gigacluster/doc-sentence-0.025.2003.clusters.txt

I'll test a bit more thoroughly now. Happy to kick it off again with a
more principled cut-off for the sentence similarity. Let me know specifics,
or post-process the current results.


Reply to this email directly or view it on GitHubhttps://github.com//issues/1#issuecomment-41262599
.

@wejradford
Copy link
Contributor Author

(Must have been a Redmine hangover...)

Ok, so the extra testing revealed a bug (good and bad). I can't have been concentrating when I was writing the comparison algorithm, so we were only comparing some of the documents in the streams ("the diamond", which is appropriate for clustering, but not for what we're doing -- we have streams).

Happily, this should have us comparing more documents and more sentences in matching documents. I've tuned against 2010-01-01 and am using thresholds of 0.029 for document similarity and 0.125 for sentence overlap. On my laptop this takes around 2m 30s.

Running on schwa07 now, see /data1/gigacluster/doc-sentence-t0.029-st0.125*

@wejradford
Copy link
Contributor Author

These are finished: /data1/gigacluster/doc-sentence-t0.029-st0.125.clusters.txt
16,612,436 sentences in ~18h.

@wejradford
Copy link
Contributor Author

@jnothman has improved the punctuation RE in 71d639c, and notes that we should use token.norm or token.raw for overlaps.

I'm planning to run this again, perhaps integrating #2, so the next version should be a little more accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants