Refining the collation #48

ebeshero · 2018-05-12T04:13:03Z

Following discussion with @raffazizzi today, I've added more normalization to a new version of the Python script. The normalization function now does two things:

it ignores the content of any element tags during the collation process (so <p/> elements with distinctly different "location markers" in their attributes can still align and be considered invariant),
it lowercases all the input that collateX works with.

The new version of the Python script also ignores most markup. I decided for now to keep <p/> elements as inline-empty because a change in paragraphing is semantically significant. For the same reason I kept div markers, add and del. I'm thinking we might just want to output del, but suppress add.

The new output is coming out in the LessMarkupV2_xmlOutput directory.

I'll run a few more variations on this theme. Question: Is it necessary/useful to have "location flag" attributes appear in the collation output on the <p/> and <lb/> elements?

The text was updated successfully, but these errors were encountered:

ebeshero · 2018-05-12T04:14:09Z

Note: I need to weave this normalization list of weirdly spelled words into the Python script, too:
#28

ebeshero · 2018-05-16T01:29:43Z

Action on @ebb: Output fresh DETAILED collation with location tags (don't suppress the <lb/> elements) now that I've properly normalized ampersands.

ebeshero · 2018-05-17T12:19:01Z

@raffazizzi @Rikkm I'm running a fresh collation this morning. As I mentioned in our meeting on Tuesday, this collation will have better alignment because it's properly normalizing ampersands and markup. Also, I've made sure the <lb> elements are present so the text locations are clearly signaled.

C-10 is freshly output already in the Full_xmlOutput directory you've been working in, Raff: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/tree/Text_Processing/collateXPrep/Full_xmlOutput

It's also sitting by itself here in my unit-testing folder: C10_xmlOutput--that might be an easier place to work with it by itself: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/tree/Text_Processing/collateXPrep/C10_xmlOutput

(As usual I've done some reorganizing to put old collation stuff away.)

ebeshero self-assigned this May 12, 2018

ebeshero added the enhancement label May 12, 2018

ebeshero closed this as completed Mar 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refining the collation #48

Refining the collation #48

ebeshero commented May 12, 2018 •

edited

Loading

ebeshero commented May 12, 2018

ebeshero commented May 16, 2018

ebeshero commented May 17, 2018

Refining the collation #48

Refining the collation #48

Comments

ebeshero commented May 12, 2018 • edited Loading

ebeshero commented May 12, 2018

ebeshero commented May 16, 2018

ebeshero commented May 17, 2018

ebeshero commented May 12, 2018 •

edited

Loading