Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Finding short, simple sentences and their longer restatements #2

Open
wejradford opened this issue Apr 29, 2014 · 12 comments
Open

Finding short, simple sentences and their longer restatements #2

wejradford opened this issue Apr 29, 2014 · 12 comments
Assignees

Comments

@wejradford
Copy link
Contributor

We need to adjust and calibrate an overlap that finds short sentences, stems the tokens, and identifies longer sentences that contain as many tokens as possible. We may consider an IDF weighting, or upweighting capitalised sequences.

This will hopefully identify useful pairs for @dominickng's work.

@wejradford wejradford self-assigned this Apr 29, 2014
@jnothman
Copy link
Contributor

jnothman commented May 8, 2014

/data1/gigacluster/clustering-0.4-0.4/*.clusters.stem_subsets now holds an attempt at pulling some more useful pairs for this task. With the following heuristics, the maximum yield from the clusters processed so far is ~24000 pairs:

  • require sentence pair score >= 0.4
  • forbid one punctuation-removed, lowercased sentence being a substring of the other (avoids additions of attribution to quotes, etc.)
  • require that the set of lowercased, Porter-stemmed, stopped tokens of one sentence is a subset of the other
  • require that the ratio of number of tokens between unnormalised sentences is > 1.5

Examples from 1994 include:

Berringer threw two interceptions in the scrimmage .
Berringer , who threw two interceptions in the scrimmage , took the decision in stride .
An international spokesman for Sephardic Jews , he was a world-renowned scholar on their history and interpretation of Jewish law .
Goan was an international spokesman for Sephardic Jews -- descendants of those who fled the Spanish Inquisition in 1492 -- and a world-renowned scholar on their history and interpretation of Jewish law .

WDYT, @dominickng?

@jnothman
Copy link
Contributor

jnothman commented May 8, 2014

(one likely change is to use absolute sentence length difference, not relative)

@dominickng
Copy link

Here are a few notes:

  • Some pairs are exact copies of one another, or exact copies with newlines (forbid exact matching w/strip?)
  • Some pairs are exact copies of quotes, with varying permutations:
    • one in double quotes (") and the other in two opening backtick quotes and two closing single quotes (`` .... '')
    • one with surrounding quotes marks, and the other without
  • Some matches are the city/location lines (LOS ANGELES AFP)
  • I tried a very simple length filter and got:
    • pairs with one sentence of 10 tokens or less: 523
    • pairs with one sentence of 15 tokens or less: 1003

I need to do a more specific analysis of sentence pairs to see if there's anything useful in there.

@jnothman
Copy link
Contributor

Exact copies are surprising, given the double-length heuristic...

On 11 May 2014 17:41, Dominick Ng [email protected] wrote:

Here are a few notes:

  • Some pairs are exact copies of one another (forbid exact matching?)
  • Some pairs are exact copies of quotes, with varying permutations:
    • one in double quotes (") and the other in two opening backtick
      quotes and two closing single quotes (`` .... '')
    • one with surrounding quotes marks, and the other without
  • Some matches are the city/location lines (LOS ANGELES AFP)
  • I tried a very simple length filter and got:
    • pairs with one sentence of 10 tokens or less: 523
    • pairs with one sentence of 15 tokens or less: 1003

I need to do a more specific analysis of sentence pairs to see if there's
anything useful in there.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-42764316
.

@jnothman
Copy link
Contributor

(Not double, sorry, but times some factor)

On 11 May 2014 18:59, Joel Nothman [email protected] wrote:

Exact copies are surprising, given the double-length heuristic...

On 11 May 2014 17:41, Dominick Ng [email protected] wrote:

Here are a few notes:

  • Some pairs are exact copies of one another (forbid exact matching?)
  • Some pairs are exact copies of quotes, with varying permutations:
    • one in double quotes (") and the other in two opening backtick
      quotes and two closing single quotes (`` .... '')
    • one with surrounding quotes marks, and the other without
  • Some matches are the city/location lines (LOS ANGELES AFP)
  • I tried a very simple length filter and got:
    • pairs with one sentence of 10 tokens or less: 523
    • pairs with one sentence of 15 tokens or less: 1003

I need to do a more specific analysis of sentence pairs to see if there's
anything useful in there.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-42764316
.

@dominickng
Copy link

Some examples are

`` The questions cannot only be solved by computers .
" The questions cannot only be solved by computers ...\n

` You should n't talk about retirement 10 minutes after the game . ''
" You should n\'t talk about retirement 10 minutes after the game . "\n

`` In fact , she is the daughter of one of our generals .
In fact , she is the daughter of one of our generals .\n

`` This man is a hero .
" This man is a hero .\n

@dominickng
Copy link

Right now the problem for me is that the pairs mostly seem to fall into these categories:

  • no useful differences between the pairs (duplication)
  • no syntactic variation between the pairs (i.e. shorter fragment has exactly the same analysis as the corresponding part of the longer pair)
  • too much difference between the pairs that might require synonym/etc. machinery to distinguish, e.g.
Shares of Disney rose 75 cents , to $ 43.25 , on the New York Stock Exchange on Wednesday .
Disney stock finished up 75 cents at 43.25 dollars on the New York Stock Exchange

I haven't actually found a pair yet with a useful syntactic distinction and complete overlap - though this is a painfully manual checking process that's going very slowly. I imagine that even when I find a pair, it may be difficult to apply the constraints to the longer sentence without entities to ground the links.

I'm wondering whether this clustering process could be applied to Clueweb '09 with the FACC annotations. If we cluster Clueweb '09 first by entity buckets, then apply this clustering over a bucket, we might get useful looking relations between entities overlapping between multiple sentences.

@jnothman
Copy link
Contributor

Are you sure you're looking at the right data? None of the examples you've
cited should match these criteria: grep 'Disney stock finished up 75 cents' /data1/gigacluster/clustering-0.4-0.4/*.stem_subsets matches
nothing...

On 11 May 2014 19:44, Dominick Ng [email protected] wrote:

Right now the problem for me is that the pairs mostly seem to fall into
these categories:

  • no useful differences between the pairs (duplication)
  • no syntactic variation between the pairs (i.e. shorter fragment has
    exactly the same analysis as the corresponding part of the longer pair)
  • too much difference between the pairs that might require
    synonym/etc. machinery to distinguish, e.g.

Shares of Disney rose 75 cents , to $ 43.25 , on the New York Stock Exchange on Wednesday .
Disney stock finished up 75 cents at 43.25 dollars on the New York Stock Exchange

I haven't actually found a pair yet with a useful syntactic distinction
and complete overlap - though this is a painfully manual checking process
that's going very slowly. I imagine that even when I find a pair, it may be
difficult to apply the constraints to the longer sentence without entities
to ground the links.

I'm wondering whether this clustering process could be applied to Clueweb
'09 with the FACC annotations. If we cluster Clueweb '09 first by entity
buckets, then apply this clustering over a bucket, we might get useful
looking relations between entities overlapping between multiple sentences.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-42766335
.

@dominickng
Copy link

I see, I'm looking at will's original data set. I'll redo the analysis

On Sunday, 11 May 2014, jnothman [email protected] wrote:

Are you sure you're looking at the right data? None of the examples you've
cited should match these criteria: grep 'Disney stock finished up 75 cents' /data1/gigacluster/clustering-0.4-0.4/*.stem_subsets matches
nothing...

On 11 May 2014 19:44, Dominick Ng <[email protected]javascript:_e(%7B%7D,'cvml','[email protected]');>
wrote:

Right now the problem for me is that the pairs mostly seem to fall into
these categories:

  • no useful differences between the pairs (duplication)
  • no syntactic variation between the pairs (i.e. shorter fragment has
    exactly the same analysis as the corresponding part of the longer pair)
  • too much difference between the pairs that might require
    synonym/etc. machinery to distinguish, e.g.

Shares of Disney rose 75 cents , to $ 43.25 , on the New York Stock
Exchange on Wednesday .
Disney stock finished up 75 cents at 43.25 dollars on the New York Stock
Exchange

I haven't actually found a pair yet with a useful syntactic distinction
and complete overlap - though this is a painfully manual checking
process
that's going very slowly. I imagine that even when I find a pair, it may
be
difficult to apply the constraints to the longer sentence without
entities
to ground the links.

I'm wondering whether this clustering process could be applied to
Clueweb
'09 with the FACC annotations. If we cluster Clueweb '09 first by entity
buckets, then apply this clustering over a bucket, we might get useful
looking relations between entities overlapping between multiple
sentences.


Reply to this email directly or view it on GitHub<
https://github.com/schwa-lab/gigacluster/issues/2#issuecomment-42766335>
.


Reply to this email directly or view it on GitHubhttps://github.com//issues/2#issuecomment-42768340
.

-Dom (on iPhone)

@jnothman
Copy link
Contributor

too much difference between the pairs that might require synonym/etc. machinery to distinguish,

Well, if that's the path you need to go down, there are probably ways to do it, as long as we have are able to learn with constraints that aren't fully lexicalised.

@jnothman
Copy link
Contributor

I see, I'm looking at will's original data set. I'll redo the analysis

Not that you won't see similarly useless/futile pairs!

@jnothman
Copy link
Contributor

Having finally got a Py3k virtualenv up with the relevant dependencies, I've commited this script in b1e885f

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants