-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bill detail: Preserve the formatting of legislative text #108
Comments
I explored options for converting the RTF string into HTML. I could not find a Python package that does the job. However, I did identify a couple options, which involve doing the conversion on the client-side (jumping off point): (1) rtf.js, though I need to explore if we can use a string-version RTF as a starting point for calling (2) a couple good-looking node packages: https://github.com/iarna/rtf-to-html and https://github.com/walling/node-unrtf. NOTE - we'll need to find a way to coordinate node_modules within a Django app. I do not believe we have other Django apps within our DataMade repertoire that do this (it might be worth learning about!), so that's a consideration. (3) Write a parser ourselves, as this Stackoverflow-er does for RTF to plan text: https://stackoverflow.com/questions/1337446/is-there-a-python-module-for-converting-rtf-to-plain-text |
i would suggest using using python's subprocess https://docs.python.org/2/library/subprocess.html I would suggest doing this as part of the import process |
thanks @fgregg - unrtf is pretty nifty (and easy to install and get started!). I also like the idea of doing the conversion on the import, rather than the view or client side. |
I have an initial solution for converting rtf to html in However, the resultant HTML is corrupt, i.e., it contains extraneous It seems that we could build a custom filter (see below), but it can be difficult to predict when and where the extraneous tags occur. Another question: is there a python parser than can facilitate the transformation of invalid HTML into valid HTML?
With this said, the converted html does not seem to preserve exactly the indentation given in the Legistar bill text - that is a requirement of this issue. |
bummer. Keeping it as html has significant seo and usability benefits, so let's keep pushing on this a little bit more. I would suggest looking at pandoc or another rtf-html converter as well. |
The fallback is always to display the pdf. |
Right, but the legislation text report does not reside in the Legistar API. (For example, it would be here, right? https://webapi.legistar.com/v1/nyc/matters/50755/attachments?token=.....) We'd need to refashion the scraper to grab the PDF link from the web interface....or scrape the html itself, as we did in the past. Also, I'd like to determine if the rtf is corrupt, before looking at another rtf-converter. Bummer indeed. |
I checked some bills, and the RTF from Legistar seems fine: I managed to convert a couple RTF instances to valid HTML with an online tool. However, I noticed that some of the scraped bills in the OCD API do not contain valid RTF text (but rather plain text or some variant of the For converters...if we want to keep pushing on this:
With that said, I am happy to try out the Ted + Pandoc combo, but I am drawn towards scraping the PDF link from the web. That seems like a relatively more reliable option. (Plus, the PDFs look nice - no worries about missed formatting.) @fgregg - what do you think? |
We have working solution for this in an open PR...also on the Councilmatic side. Currently testing it on the staging site. |
We are running the full conversion script on the production server (after determining that we needed: a new version of LibreOffice, and code suitable for Python 3.4.3). Last steps
|
We've been able to convert RTF to HTML with a new management script in django-councilmatic. I've added it to our NYC crontasks, as well. We also have an open PR that will allow us to pull in PDFs of Bill text, too. Marking this issue as closed! |
The legislative text on the Bill detail does not preserve the tabs and underlines given in Legistar.
Example:
https://laws.council.nyc.gov/legislation/int-1633-2017/
http://legistar.council.nyc.gov/LegislationDetail.aspx?ID=3066702&GUID=FF2098E2-9EC9-47AB-9131-B02FDEA71AA6
Why? We parse the "plain text" from the Legistar API, which does not contain such formatting. (See Councilmatic filter.)
Solutions?
(1) Find a way to translate the rtf_text into HTML on the Councilmatic side.
(2) Revert to scraping the
full_text
from the web, as we did in the past: https://github.com/opencivicdata/scrapers-us-municipal/blob/12a6309c3148b83729a1f2420b9ee1bd1bd633e2/nyc/bills.py#L127I'd prioritize option number one, if its possible.
The text was updated successfully, but these errors were encountered: