Bill detail: Preserve the formatting of legislative text #108

reginafcompton · 2017-12-19T17:16:18Z

The legislative text on the Bill detail does not preserve the tabs and underlines given in Legistar.

Example:
https://laws.council.nyc.gov/legislation/int-1633-2017/
http://legistar.council.nyc.gov/LegislationDetail.aspx?ID=3066702&GUID=FF2098E2-9EC9-47AB-9131-B02FDEA71AA6

Why? We parse the "plain text" from the Legistar API, which does not contain such formatting. (See Councilmatic filter.)

Solutions?
(1) Find a way to translate the rtf_text into HTML on the Councilmatic side.
(2) Revert to scraping the full_text from the web, as we did in the past: https://github.com/opencivicdata/scrapers-us-municipal/blob/12a6309c3148b83729a1f2420b9ee1bd1bd633e2/nyc/bills.py#L127

I'd prioritize option number one, if its possible.

The text was updated successfully, but these errors were encountered:

reginafcompton · 2017-12-28T23:07:17Z

I explored options for converting the RTF string into HTML. I could not find a Python package that does the job. However, I did identify a couple options, which involve doing the conversion on the client-side (jumping off point):

(1) rtf.js, though I need to explore if we can use a string-version RTF as a starting point for calling displayRTFFile: https://github.com/tbluemel/rtf.js/blob/master/samples/rtf.html#L148

(2) a couple good-looking node packages: https://github.com/iarna/rtf-to-html and https://github.com/walling/node-unrtf. NOTE - we'll need to find a way to coordinate node_modules within a Django app. I do not believe we have other Django apps within our DataMade repertoire that do this (it might be worth learning about!), so that's a consideration.

(3) Write a parser ourselves, as this Stackoverflow-er does for RTF to plan text: https://stackoverflow.com/questions/1337446/is-there-a-python-module-for-converting-rtf-to-plain-text

fgregg · 2017-12-29T01:33:03Z

i would suggest using

using python's subprocess https://docs.python.org/2/library/subprocess.html

I would suggest doing this as part of the import process

reginafcompton · 2017-12-29T03:15:07Z

thanks @fgregg - unrtf is pretty nifty (and easy to install and get started!). I also like the idea of doing the conversion on the import, rather than the view or client side.

reginafcompton · 2017-12-29T23:02:56Z

I have an initial solution for converting rtf to html in import_data: datamade/django-councilmatic#165.

However, the resultant HTML is corrupt, i.e., it contains extraneous </div> tags that disrupt the formatting. (See screenshot below for attempt to capture formatting issue.) Question to address: is the corruption in the rtf or the result of the unrtf conversion?

It seems that we could build a custom filter (see below), but it can be difficult to predict when and where the extraneous tags occur. Another question: is there a python parser than can facilitate the transformation of invalid HTML into valid HTML?

# Compress HTML to searchable string without multi-lines.
no_multiline = re.sub(r"([\n ])\1*", ' ', text)

# Remove extra </div> tags and other extraneous text.
return no_multiline.replace('</div> <br> </div>', '</div> <br>').replace('..Title', '').replace('..Body', '')

With this said, the converted html does not seem to preserve exactly the indentation given in the Legistar bill text - that is a requirement of this issue.

fgregg · 2017-12-29T23:06:22Z

bummer.

Keeping it as html has significant seo and usability benefits, so let's keep pushing on this a little bit more.

I would suggest looking at pandoc or another rtf-html converter as well.

fgregg · 2017-12-29T23:06:38Z

The fallback is always to display the pdf.

reginafcompton · 2017-12-29T23:12:23Z

Right, but the legislation text report does not reside in the Legistar API. (For example, it would be here, right? https://webapi.legistar.com/v1/nyc/matters/50755/attachments?token=.....) We'd need to refashion the scraper to grab the PDF link from the web interface....or scrape the html itself, as we did in the past.

Also, I'd like to determine if the rtf is corrupt, before looking at another rtf-converter.

Bummer indeed.

reginafcompton · 2018-01-03T21:34:14Z

I checked some bills, and the RTF from Legistar seems fine: I managed to convert a couple RTF instances to valid HTML with an online tool.

However, I noticed that some of the scraped bills in the OCD API do not contain valid RTF text (but rather plain text or some variant of the plain_text)....this seems to come as a result of not having RTF in Legistar:
https://ocd.datamade.us/ocd-bill/1ff42bd8-2919-4e2d-bb1d-adbe5d36689d/
https://ocd.datamade.us/ocd-bill/267a7123-0027-4ad3-b28f-c4b7ff866a1f/
https://ocd.datamade.us/ocd-bill/283ae440-3f91-4f84-8263-7c9ac7dcd4a2/

For converters...if we want to keep pushing on this:

Pandoc does not support RTF to HTML, given what I can see in their documentation...but maybe we can use Ted and Pandoc together? https://stackoverflow.com/questions/30448176/how-to-convert-rtf-to-markdown-on-the-unix-osx-command-line-similar-to-pandoc
this tool gets us close, but it does not correctly convert underlines, nor does it support characters (i.e., we would lose symbols like "§"): https://github.com/lvu/rtf2html
Libre Office - as we did with the Metro merger

With that said, I am happy to try out the Ted + Pandoc combo, but I am drawn towards scraping the PDF link from the web. That seems like a relatively more reliable option. (Plus, the PDFs look nice - no worries about missed formatting.) @fgregg - what do you think?

reginafcompton · 2018-01-09T15:24:42Z

We have working solution for this in an open PR...also on the Councilmatic side. Currently testing it on the staging site.

reginafcompton · 2018-01-10T21:29:45Z

We are running the full conversion script on the production server (after determining that we needed: a new version of LibreOffice, and code suitable for Python 3.4.3).

Last steps

add a cronjob that runs after import_data
use flock! we should lock the same file for import_data and convert_rtf, in case the convert script takes longer than 15 minutes (which seems like a possibility, if we import a large quantity of bills)
merge this PR, and pin the requirements to master (as we did in the past)

reginafcompton · 2018-01-19T16:25:57Z

We've been able to convert RTF to HTML with a new management script in django-councilmatic. I've added it to our NYC crontasks, as well.

We also have an open PR that will allow us to pull in PDFs of Bill text, too. Marking this issue as closed!

reginafcompton added this to the Next phase milestone Dec 19, 2017

reginafcompton mentioned this issue Jan 10, 2018

Log #115

Merged

reginafcompton closed this as completed Jan 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bill detail: Preserve the formatting of legislative text #108

Bill detail: Preserve the formatting of legislative text #108

reginafcompton commented Dec 19, 2017

reginafcompton commented Dec 28, 2017 •

edited

Loading

fgregg commented Dec 29, 2017

reginafcompton commented Dec 29, 2017

reginafcompton commented Dec 29, 2017 •

edited

Loading

fgregg commented Dec 29, 2017

fgregg commented Dec 29, 2017

reginafcompton commented Dec 29, 2017

reginafcompton commented Jan 3, 2018 •

edited

Loading

reginafcompton commented Jan 9, 2018 •

edited

Loading

reginafcompton commented Jan 10, 2018 •

edited

Loading

reginafcompton commented Jan 19, 2018

Bill detail: Preserve the formatting of legislative text #108

Bill detail: Preserve the formatting of legislative text #108

Comments

reginafcompton commented Dec 19, 2017

reginafcompton commented Dec 28, 2017 • edited Loading

fgregg commented Dec 29, 2017

reginafcompton commented Dec 29, 2017

reginafcompton commented Dec 29, 2017 • edited Loading

fgregg commented Dec 29, 2017

fgregg commented Dec 29, 2017

reginafcompton commented Dec 29, 2017

reginafcompton commented Jan 3, 2018 • edited Loading

reginafcompton commented Jan 9, 2018 • edited Loading

reginafcompton commented Jan 10, 2018 • edited Loading

reginafcompton commented Jan 19, 2018

reginafcompton commented Dec 28, 2017 •

edited

Loading

reginafcompton commented Dec 29, 2017 •

edited

Loading

reginafcompton commented Jan 3, 2018 •

edited

Loading

reginafcompton commented Jan 9, 2018 •

edited

Loading

reginafcompton commented Jan 10, 2018 •

edited

Loading