Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refining the collation #48

Closed
ebeshero opened this issue May 12, 2018 · 3 comments
Closed

Refining the collation #48

ebeshero opened this issue May 12, 2018 · 3 comments
Assignees

Comments

@ebeshero
Copy link
Member

ebeshero commented May 12, 2018

Following discussion with @raffazizzi today, I've added more normalization to a new version of the Python script. The normalization function now does two things:

  1. it ignores the content of any element tags during the collation process (so <p/> elements with distinctly different "location markers" in their attributes can still align and be considered invariant),
  2. it lowercases all the input that collateX works with.

The new version of the Python script also ignores most markup. I decided for now to keep <p/> elements as inline-empty because a change in paragraphing is semantically significant. For the same reason I kept div markers, add and del. I'm thinking we might just want to output del, but suppress add.

The new output is coming out in the LessMarkupV2_xmlOutput directory.

I'll run a few more variations on this theme. Question: Is it necessary/useful to have "location flag" attributes appear in the collation output on the <p/> and <lb/> elements?

@ebeshero ebeshero self-assigned this May 12, 2018
@ebeshero
Copy link
Member Author

Note: I need to weave this normalization list of weirdly spelled words into the Python script, too:
#28

@ebeshero
Copy link
Member Author

Action on @ebb: Output fresh DETAILED collation with location tags (don't suppress the <lb/> elements) now that I've properly normalized ampersands.

@ebeshero
Copy link
Member Author

@raffazizzi @Rikkm I'm running a fresh collation this morning. As I mentioned in our meeting on Tuesday, this collation will have better alignment because it's properly normalizing ampersands and markup. Also, I've made sure the <lb> elements are present so the text locations are clearly signaled.

C-10 is freshly output already in the Full_xmlOutput directory you've been working in, Raff: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/tree/Text_Processing/collateXPrep/Full_xmlOutput

It's also sitting by itself here in my unit-testing folder: C10_xmlOutput--that might be an easier place to work with it by itself: https://github.com/PghFrankenstein/Pittsburgh_Frankenstein/tree/Text_Processing/collateXPrep/C10_xmlOutput

(As usual I've done some reorganizing to put old collation stuff away.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant