Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bill detail: Preserve the formatting of legislative text #108

Closed
reginafcompton opened this issue Dec 19, 2017 · 11 comments
Closed

Bill detail: Preserve the formatting of legislative text #108

reginafcompton opened this issue Dec 19, 2017 · 11 comments
Milestone

Comments

@reginafcompton
Copy link
Contributor

The legislative text on the Bill detail does not preserve the tabs and underlines given in Legistar.

Example:
https://laws.council.nyc.gov/legislation/int-1633-2017/
http://legistar.council.nyc.gov/LegislationDetail.aspx?ID=3066702&GUID=FF2098E2-9EC9-47AB-9131-B02FDEA71AA6

Why? We parse the "plain text" from the Legistar API, which does not contain such formatting. (See Councilmatic filter.)

Solutions?
(1) Find a way to translate the rtf_text into HTML on the Councilmatic side.
(2) Revert to scraping the full_text from the web, as we did in the past: https://github.com/opencivicdata/scrapers-us-municipal/blob/12a6309c3148b83729a1f2420b9ee1bd1bd633e2/nyc/bills.py#L127

I'd prioritize option number one, if its possible.

@reginafcompton reginafcompton added this to the Next phase milestone Dec 19, 2017
@reginafcompton
Copy link
Contributor Author

reginafcompton commented Dec 28, 2017

I explored options for converting the RTF string into HTML. I could not find a Python package that does the job. However, I did identify a couple options, which involve doing the conversion on the client-side (jumping off point):

(1) rtf.js, though I need to explore if we can use a string-version RTF as a starting point for calling displayRTFFile: https://github.com/tbluemel/rtf.js/blob/master/samples/rtf.html#L148

(2) a couple good-looking node packages: https://github.com/iarna/rtf-to-html and https://github.com/walling/node-unrtf. NOTE - we'll need to find a way to coordinate node_modules within a Django app. I do not believe we have other Django apps within our DataMade repertoire that do this (it might be worth learning about!), so that's a consideration.

(3) Write a parser ourselves, as this Stackoverflow-er does for RTF to plan text: https://stackoverflow.com/questions/1337446/is-there-a-python-module-for-converting-rtf-to-plain-text

@fgregg
Copy link
Member

fgregg commented Dec 29, 2017

i would suggest using

using python's subprocess https://docs.python.org/2/library/subprocess.html

I would suggest doing this as part of the import process

@reginafcompton
Copy link
Contributor Author

thanks @fgregg - unrtf is pretty nifty (and easy to install and get started!). I also like the idea of doing the conversion on the import, rather than the view or client side.

@reginafcompton
Copy link
Contributor Author

reginafcompton commented Dec 29, 2017

I have an initial solution for converting rtf to html in import_data: datamade/django-councilmatic#165.

However, the resultant HTML is corrupt, i.e., it contains extraneous </div> tags that disrupt the formatting. (See screenshot below for attempt to capture formatting issue.) Question to address: is the corruption in the rtf or the result of the unrtf conversion?

screen shot 2017-12-29 at 4 30 39 pm

It seems that we could build a custom filter (see below), but it can be difficult to predict when and where the extraneous tags occur. Another question: is there a python parser than can facilitate the transformation of invalid HTML into valid HTML?

# Compress HTML to searchable string without multi-lines.
no_multiline = re.sub(r"([\n ])\1*", ' ', text)

# Remove extra </div> tags and other extraneous text.
return no_multiline.replace('</div> <br> </div>', '</div> <br>').replace('..Title', '').replace('..Body', '')

With this said, the converted html does not seem to preserve exactly the indentation given in the Legistar bill text - that is a requirement of this issue.

@fgregg
Copy link
Member

fgregg commented Dec 29, 2017

bummer.

Keeping it as html has significant seo and usability benefits, so let's keep pushing on this a little bit more.

I would suggest looking at pandoc or another rtf-html converter as well.

@fgregg
Copy link
Member

fgregg commented Dec 29, 2017

The fallback is always to display the pdf.

@reginafcompton
Copy link
Contributor Author

Right, but the legislation text report does not reside in the Legistar API. (For example, it would be here, right? https://webapi.legistar.com/v1/nyc/matters/50755/attachments?token=.....) We'd need to refashion the scraper to grab the PDF link from the web interface....or scrape the html itself, as we did in the past.

Also, I'd like to determine if the rtf is corrupt, before looking at another rtf-converter.

Bummer indeed.

@reginafcompton
Copy link
Contributor Author

reginafcompton commented Jan 3, 2018

I checked some bills, and the RTF from Legistar seems fine: I managed to convert a couple RTF instances to valid HTML with an online tool.

However, I noticed that some of the scraped bills in the OCD API do not contain valid RTF text (but rather plain text or some variant of the plain_text)....this seems to come as a result of not having RTF in Legistar:
https://ocd.datamade.us/ocd-bill/1ff42bd8-2919-4e2d-bb1d-adbe5d36689d/
https://ocd.datamade.us/ocd-bill/267a7123-0027-4ad3-b28f-c4b7ff866a1f/
https://ocd.datamade.us/ocd-bill/283ae440-3f91-4f84-8263-7c9ac7dcd4a2/

For converters...if we want to keep pushing on this:

With that said, I am happy to try out the Ted + Pandoc combo, but I am drawn towards scraping the PDF link from the web. That seems like a relatively more reliable option. (Plus, the PDFs look nice - no worries about missed formatting.) @fgregg - what do you think?

@reginafcompton
Copy link
Contributor Author

reginafcompton commented Jan 9, 2018

We have working solution for this in an open PR...also on the Councilmatic side. Currently testing it on the staging site.

@reginafcompton
Copy link
Contributor Author

reginafcompton commented Jan 10, 2018

We are running the full conversion script on the production server (after determining that we needed: a new version of LibreOffice, and code suitable for Python 3.4.3).

Last steps

  • add a cronjob that runs after import_data
  • use flock! we should lock the same file for import_data and convert_rtf, in case the convert script takes longer than 15 minutes (which seems like a possibility, if we import a large quantity of bills)
  • merge this PR, and pin the requirements to master (as we did in the past)

@reginafcompton reginafcompton mentioned this issue Jan 10, 2018
Merged
@reginafcompton
Copy link
Contributor Author

We've been able to convert RTF to HTML with a new management script in django-councilmatic. I've added it to our NYC crontasks, as well.

We also have an open PR that will allow us to pull in PDFs of Bill text, too. Marking this issue as closed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants