-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finding short, simple sentences and their longer restatements #2
Comments
Examples from 1994 include:
WDYT, @dominickng? |
(one likely change is to use absolute sentence length difference, not relative) |
Here are a few notes:
I need to do a more specific analysis of sentence pairs to see if there's anything useful in there. |
Exact copies are surprising, given the double-length heuristic... On 11 May 2014 17:41, Dominick Ng [email protected] wrote:
|
(Not double, sorry, but times some factor) On 11 May 2014 18:59, Joel Nothman [email protected] wrote:
|
Some examples are
|
Right now the problem for me is that the pairs mostly seem to fall into these categories:
I haven't actually found a pair yet with a useful syntactic distinction and complete overlap - though this is a painfully manual checking process that's going very slowly. I imagine that even when I find a pair, it may be difficult to apply the constraints to the longer sentence without entities to ground the links. I'm wondering whether this clustering process could be applied to Clueweb '09 with the FACC annotations. If we cluster Clueweb '09 first by entity buckets, then apply this clustering over a bucket, we might get useful looking relations between entities overlapping between multiple sentences. |
Are you sure you're looking at the right data? None of the examples you've On 11 May 2014 19:44, Dominick Ng [email protected] wrote:
|
I see, I'm looking at will's original data set. I'll redo the analysis On Sunday, 11 May 2014, jnothman [email protected] wrote:
-Dom (on iPhone) |
Well, if that's the path you need to go down, there are probably ways to do it, as long as we have are able to learn with constraints that aren't fully lexicalised. |
Not that you won't see similarly useless/futile pairs! |
Having finally got a Py3k virtualenv up with the relevant dependencies, I've commited this script in b1e885f |
We need to adjust and calibrate an overlap that finds short sentences, stems the tokens, and identifies longer sentences that contain as many tokens as possible. We may consider an IDF weighting, or upweighting capitalised sequences.
This will hopefully identify useful pairs for @dominickng's work.
The text was updated successfully, but these errors were encountered: