Finding short, simple sentences and their longer restatements #2

wejradford · 2014-04-29T10:09:40Z

We need to adjust and calibrate an overlap that finds short sentences, stems the tokens, and identifies longer sentences that contain as many tokens as possible. We may consider an IDF weighting, or upweighting capitalised sequences.

This will hopefully identify useful pairs for @dominickng's work.

jnothman · 2014-05-08T13:00:51Z

/data1/gigacluster/clustering-0.4-0.4/*.clusters.stem_subsets now holds an attempt at pulling some more useful pairs for this task. With the following heuristics, the maximum yield from the clusters processed so far is ~24000 pairs:

require sentence pair score >= 0.4
forbid one punctuation-removed, lowercased sentence being a substring of the other (avoids additions of attribution to quotes, etc.)
require that the set of lowercased, Porter-stemmed, stopped tokens of one sentence is a subset of the other
require that the ratio of number of tokens between unnormalised sentences is > 1.5

Examples from 1994 include:

Berringer threw two interceptions in the scrimmage .
Berringer , who threw two interceptions in the scrimmage , took the decision in stride .

An international spokesman for Sephardic Jews , he was a world-renowned scholar on their history and interpretation of Jewish law .
Goan was an international spokesman for Sephardic Jews -- descendants of those who fled the Spanish Inquisition in 1492 -- and a world-renowned scholar on their history and interpretation of Jewish law .

WDYT, @dominickng?

jnothman · 2014-05-08T13:19:26Z

(one likely change is to use absolute sentence length difference, not relative)

dominickng · 2014-05-11T07:41:54Z

Here are a few notes:

Some pairs are exact copies of one another, or exact copies with newlines (forbid exact matching w/strip?)
Some pairs are exact copies of quotes, with varying permutations:
- one in double quotes (") and the other in two opening backtick quotes and two closing single quotes (`` .... '')
- one with surrounding quotes marks, and the other without
Some matches are the city/location lines (LOS ANGELES AFP)
I tried a very simple length filter and got:
- pairs with one sentence of 10 tokens or less: 523
- pairs with one sentence of 15 tokens or less: 1003

I need to do a more specific analysis of sentence pairs to see if there's anything useful in there.

jnothman · 2014-05-11T08:59:13Z

Exact copies are surprising, given the double-length heuristic...

On 11 May 2014 17:41, Dominick Ng [email protected] wrote:

Here are a few notes:

Some pairs are exact copies of one another (forbid exact matching?)

Some pairs are exact copies of quotes, with varying permutations:

one in double quotes (") and the other in two opening backtick
quotes and two closing single quotes (`` .... '')

one with surrounding quotes marks, and the other without

Some matches are the city/location lines (LOS ANGELES AFP)

I tried a very simple length filter and got:

pairs with one sentence of 10 tokens or less: 523

pairs with one sentence of 15 tokens or less: 1003

I need to do a more specific analysis of sentence pairs to see if there's
anything useful in there.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-42764316
.

jnothman · 2014-05-11T08:59:39Z

(Not double, sorry, but times some factor)

On 11 May 2014 18:59, Joel Nothman [email protected] wrote:

Exact copies are surprising, given the double-length heuristic...

On 11 May 2014 17:41, Dominick Ng [email protected] wrote:

Here are a few notes:

Some pairs are exact copies of one another (forbid exact matching?)

Some pairs are exact copies of quotes, with varying permutations:

one in double quotes (") and the other in two opening backtick
quotes and two closing single quotes (`` .... '')

one with surrounding quotes marks, and the other without

Some matches are the city/location lines (LOS ANGELES AFP)

I tried a very simple length filter and got:

pairs with one sentence of 10 tokens or less: 523

pairs with one sentence of 15 tokens or less: 1003

I need to do a more specific analysis of sentence pairs to see if there's
anything useful in there.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-42764316
.

dominickng · 2014-05-11T09:14:03Z

Some examples are

`` The questions cannot only be solved by computers .
" The questions cannot only be solved by computers ...\n

` You should n't talk about retirement 10 minutes after the game . ''
" You should n\'t talk about retirement 10 minutes after the game . "\n

`` In fact , she is the daughter of one of our generals .
In fact , she is the daughter of one of our generals .\n

`` This man is a hero .
" This man is a hero .\n

dominickng · 2014-05-11T09:44:21Z

Right now the problem for me is that the pairs mostly seem to fall into these categories:

no useful differences between the pairs (duplication)
no syntactic variation between the pairs (i.e. shorter fragment has exactly the same analysis as the corresponding part of the longer pair)
too much difference between the pairs that might require synonym/etc. machinery to distinguish, e.g.

Shares of Disney rose 75 cents , to $ 43.25 , on the New York Stock Exchange on Wednesday .
Disney stock finished up 75 cents at 43.25 dollars on the New York Stock Exchange

I haven't actually found a pair yet with a useful syntactic distinction and complete overlap - though this is a painfully manual checking process that's going very slowly. I imagine that even when I find a pair, it may be difficult to apply the constraints to the longer sentence without entities to ground the links.

I'm wondering whether this clustering process could be applied to Clueweb '09 with the FACC annotations. If we cluster Clueweb '09 first by entity buckets, then apply this clustering over a bucket, we might get useful looking relations between entities overlapping between multiple sentences.

jnothman · 2014-05-11T11:36:09Z

Are you sure you're looking at the right data? None of the examples you've
cited should match these criteria: grep 'Disney stock finished up 75 cents' /data1/gigacluster/clustering-0.4-0.4/*.stem_subsets matches
nothing...

On 11 May 2014 19:44, Dominick Ng [email protected] wrote:

Right now the problem for me is that the pairs mostly seem to fall into
these categories:

no useful differences between the pairs (duplication)

no syntactic variation between the pairs (i.e. shorter fragment has
exactly the same analysis as the corresponding part of the longer pair)

too much difference between the pairs that might require
synonym/etc. machinery to distinguish, e.g.

Shares of Disney rose 75 cents , to $ 43.25 , on the New York Stock Exchange on Wednesday .
Disney stock finished up 75 cents at 43.25 dollars on the New York Stock Exchange

I haven't actually found a pair yet with a useful syntactic distinction
and complete overlap - though this is a painfully manual checking process
that's going very slowly. I imagine that even when I find a pair, it may be
difficult to apply the constraints to the longer sentence without entities
to ground the links.

I'm wondering whether this clustering process could be applied to Clueweb
'09 with the FACC annotations. If we cluster Clueweb '09 first by entity
buckets, then apply this clustering over a bucket, we might get useful
looking relations between entities overlapping between multiple sentences.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-42766335
.

dominickng · 2014-05-11T11:39:34Z

I see, I'm looking at will's original data set. I'll redo the analysis

On Sunday, 11 May 2014, jnothman [email protected] wrote:

Are you sure you're looking at the right data? None of the examples you've
cited should match these criteria: grep 'Disney stock finished up 75 cents' /data1/gigacluster/clustering-0.4-0.4/*.stem_subsets matches
nothing...

On 11 May 2014 19:44, Dominick Ng <[email protected]javascript:_e(%7B%7D,'cvml','[email protected]');>
wrote:

Right now the problem for me is that the pairs mostly seem to fall into
these categories:

no useful differences between the pairs (duplication)

no syntactic variation between the pairs (i.e. shorter fragment has
exactly the same analysis as the corresponding part of the longer pair)

too much difference between the pairs that might require
synonym/etc. machinery to distinguish, e.g.

Shares of Disney rose 75 cents , to $ 43.25 , on the New York Stock
Exchange on Wednesday .
Disney stock finished up 75 cents at 43.25 dollars on the New York Stock
Exchange

I haven't actually found a pair yet with a useful syntactic distinction
and complete overlap - though this is a painfully manual checking
process
that's going very slowly. I imagine that even when I find a pair, it may
be
difficult to apply the constraints to the longer sentence without
entities
to ground the links.

I'm wondering whether this clustering process could be applied to
Clueweb
'09 with the FACC annotations. If we cluster Clueweb '09 first by entity
buckets, then apply this clustering over a bucket, we might get useful
looking relations between entities overlapping between multiple
sentences.

—
Reply to this email directly or view it on GitHub<
https://github.com/schwa-lab/gigacluster/issues/2#issuecomment-42766335>
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-42768340
.

-Dom (on iPhone)

jnothman · 2014-05-11T11:43:43Z

too much difference between the pairs that might require synonym/etc. machinery to distinguish,

Well, if that's the path you need to go down, there are probably ways to do it, as long as we have are able to learn with constraints that aren't fully lexicalised.

jnothman · 2014-05-11T11:45:45Z

I see, I'm looking at will's original data set. I'll redo the analysis

Not that you won't see similarly useless/futile pairs!

jnothman · 2014-05-11T14:08:15Z

Having finally got a Py3k virtualenv up with the relevant dependencies, I've commited this script in b1e885f

wejradford self-assigned this Apr 29, 2014

wejradford mentioned this issue Apr 29, 2014

Two-phase clustering #1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finding short, simple sentences and their longer restatements #2

Finding short, simple sentences and their longer restatements #2

wejradford commented Apr 29, 2014

jnothman commented May 8, 2014

jnothman commented May 8, 2014

dominickng commented May 11, 2014

jnothman commented May 11, 2014

jnothman commented May 11, 2014

dominickng commented May 11, 2014

dominickng commented May 11, 2014

jnothman commented May 11, 2014

dominickng commented May 11, 2014

jnothman commented May 11, 2014

jnothman commented May 11, 2014

jnothman commented May 11, 2014

Finding short, simple sentences and their longer restatements #2

Finding short, simple sentences and their longer restatements #2

Comments

wejradford commented Apr 29, 2014

jnothman commented May 8, 2014

jnothman commented May 8, 2014

dominickng commented May 11, 2014

jnothman commented May 11, 2014

jnothman commented May 11, 2014

dominickng commented May 11, 2014

dominickng commented May 11, 2014

jnothman commented May 11, 2014

dominickng commented May 11, 2014

jnothman commented May 11, 2014

jnothman commented May 11, 2014

jnothman commented May 11, 2014