Two-phase clustering #1

wejradford · 2014-04-23T07:41:32Z

We want two stages of clustering:

Topic: using sqrt(tf(term)) * idf(term) where terms are case-sensitive unigrams, having removed stopwords and punctuation. Use a top-decile cut-off.
Sentences: report size of intersection; size of a, b; size of union; unweighted dot-product, IDF dot-product, norm. Use a top-quartile cut-off.

Validation data is 2003.

Extensions might be to have a per-stream IDF model, case-insensitive IDF...

The text was updated successfully, but these errors were encountered:

jnothman · 2014-04-23T09:21:56Z

Looks good.

In the document case, I'd estimate the cutoff from a sample. In the second case, I'd get the top per document pair.

wejradford · 2014-04-24T04:29:44Z

09b9b4b is a very rough first cut of this.

We're using document pairs with over 0.025 match, then the top 10 sentences by overlap. The results will be in schwa07:/data1/gigacluster/doc-sentence-0.025*.

I'll need a bit more time to double check things are correct, but I wanted to get some initial results there.

Have a look at the source (https://github.com/schwa-lab/gigacluster/blob/master/gigacluster/comparators.py#L34) for what's printed in what order...

jnothman · 2014-04-24T06:29:13Z

FYI, just entering 09b9b4b5 will link the commit: 09b9b4b

jnothman · 2014-04-24T06:31:37Z

0.025 sounds small (but I know cosine similarities can be small because of the norm). Maybe we want something much smaller than a decile.

wejradford · 2014-04-24T09:57:15Z

Yeah, I forgot to enter #1 in the commit message, and then went to paste the whole sha 😸

The processing as-is took just under 3h (162m) for the 2003 slice of data: schwa07:/data1/gigacluster/doc-sentence-0.025.2003.clusters.txt

I'll test a bit more thoroughly now. Happy to kick it off again with a more principled cut-off for the sentence similarity. Let me know specifics, or post-process the current results.

jnothman · 2014-04-24T11:09:49Z

(I meant you don't need commit:)

On 24 April 2014 19:57, Will Radford [email protected] wrote:

Yeah, I forgot to enter #1 https://github.com/schwa-lab/gigacluster/issues/1in the commit message, and then went to paste the whole sha [image:
😸]

The processing as-is took just under 3h (162m) for the 2003 slice of data:
schwa07:/data1/gigacluster/doc-sentence-0.025.2003.clusters.txt

I'll test a bit more thoroughly now. Happy to kick it off again with a
more principled cut-off for the sentence similarity. Let me know specifics,
or post-process the current results.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/1#issuecomment-41262599
.

…olds. (#1)

wejradford · 2014-04-24T12:25:49Z

(Must have been a Redmine hangover...)

Ok, so the extra testing revealed a bug (good and bad). I can't have been concentrating when I was writing the comparison algorithm, so we were only comparing some of the documents in the streams ("the diamond", which is appropriate for clustering, but not for what we're doing -- we have streams).

Happily, this should have us comparing more documents and more sentences in matching documents. I've tuned against 2010-01-01 and am using thresholds of 0.029 for document similarity and 0.125 for sentence overlap. On my laptop this takes around 2m 30s.

Running on schwa07 now, see /data1/gigacluster/doc-sentence-t0.029-st0.125*

wejradford · 2014-04-26T22:33:54Z

These are finished: /data1/gigacluster/doc-sentence-t0.029-st0.125.clusters.txt
16,612,436 sentences in ~18h.

wejradford · 2014-04-29T10:10:47Z

@jnothman has improved the punctuation RE in 71d639c, and notes that we should use token.norm or token.raw for overlaps.

I'm planning to run this again, perhaps integrating #2, so the next version should be a little more accurate.

…olds. (#1)

wejradford self-assigned this Apr 23, 2014

wejradford added a commit that referenced this issue Apr 24, 2014

Replaced iter_pairs() with itertools.product(). Added sentence thresh…

d22f305

…olds. (#1)

wejradford closed this as completed Apr 30, 2014

wejradford added a commit that referenced this issue Jun 5, 2014

Replaced iter_pairs() with itertools.product(). Added sentence thresh…

47338fd

…olds. (#1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two-phase clustering #1

Two-phase clustering #1

wejradford commented Apr 23, 2014

jnothman commented Apr 23, 2014

wejradford commented Apr 24, 2014

jnothman commented Apr 24, 2014

jnothman commented Apr 24, 2014

wejradford commented Apr 24, 2014

jnothman commented Apr 24, 2014

wejradford commented Apr 24, 2014

wejradford commented Apr 26, 2014

wejradford commented Apr 29, 2014

Two-phase clustering #1

Two-phase clustering #1

Comments

wejradford commented Apr 23, 2014

jnothman commented Apr 23, 2014

wejradford commented Apr 24, 2014

jnothman commented Apr 24, 2014

jnothman commented Apr 24, 2014

wejradford commented Apr 24, 2014

jnothman commented Apr 24, 2014

wejradford commented Apr 24, 2014

wejradford commented Apr 26, 2014

wejradford commented Apr 29, 2014