-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect additional property tax related data using LLMs #8
Comments
I've attached a sheet of on some applications that ingest documents or files and use LLM prompting to extract information from them. The options that we want will follow a template similar to the following:
As anticipated, this space is very new, and dozens of applications have been released within the past few months; many of those are difficult to vet for quality. Most of them have differential pricing that increases with the number of documents uploaded and/or queries per month. I recommend we look first at Quivr and/or Steamship, which both have transparent codebases, thorough documentation, and a publicly identified person whom we can contact to discuss the software. If available options aren't a good fit for the scale of data we plan to use, we may be able to code our own solution using a similar template to these projects. |
As discussed previously, there is at least one LLM that has been trained specifically on legal text, called LEGAL-BERT. See Chalkidis et al. (2020), "LEGAL-BERT: The Muppets straight out of Law School", https://arxiv.org/abs/2010.02559. The model is available on HuggingFace, and may be possible to "swap in" to a preexisting LLM-based application to compare performance against a more generalist model: https://huggingface.co/nlpaueb Two more research papers on the tailoring of these models for legal text are:
|
@mbjackson-capp Great work! I agree that Quivr and Steamship are probably the right place to start, with LEGAL-BERT and roll-your-own as fallbacks. I think next steps are getting a handle on what data exists county-wide, figuring out which data elements they share/we can extract, and starting to build a small corpus of documents. |
As we found this week, the technology in this field is very nascent. Before advancing, it may make sense to wait a few weeks/months until a dominant player has emerged with an inexpensive, user-friendly application (and/or until more people have gained experience coding these systems, who could advise us or work with us to meet our needs). Efforts to systematically gather PDFs in an automated manner were largely unsuccessful. With a mixture of basic web scraping and more manual downloading, I did assemble the following:
These files amount to 2.11 GB, and are stored in a location I have messaged you separately. The vast majority of them still need OCR, though mass-OCR is feasible with Acrobat. Filenames are largely, but not fully, standardized (I changed some filenames to make them more standard). We should continue to think about scoping which fields to extract from the TIF data, and about which forms are most likely to track that information over time. More notes about data retrieval It seems like there's no way to scrape all the redevelopment plans and ordinances from the map page directly, but they seem to exist in a single directory which is Forbidden (https://www.chicago.gov/content/dam/city/depts/dcd/tif/plans/). More annual reports for TIFs across the County could be scraped from the Treasurer's office with a systematic approach to searching municipalities' names and/or iterating through the numbered code that the Treasurer uses for those municipalities: https://illinoiscomptroller.gov/constituent-services/local-government/local-government-warehouse/searchform/?SearchType=TIFSearch |
Excellent, thanks for the update @mbjackson-capp. We'll restart this issue once the technology settles down a bit. In the meantime, I'll continue to put out feelers for more TIF and budget data and will manually add to what you've already collected. |
Goal
Collect additional property tax related data from documents, using LLMs for parsing.
Overview
There is a significant amount of useful taxing district data currently locked in non-machine-readable formats, including: TIF ordinance, TIF redevelopment plans, municipal/district budgets, SSA info, etc. If this data can be extracted and parsed, it would be a huge boon to PTAXSIM and would likely be the first ever collection of such data.
The problem is that this data is messy. There is no standard format for something like TIF ordinance, so each document will have a completely different format and language, depending on the municipality. Further, nearly all data of this type comes as PDF scans of legislative text - usually without any OCR applied - spanning hundreds or thousands of pages. As such, parsing this data into useful SQL tables is a massive challenge.
Fortunately, new tech may be able to help with this task. Current LLMs have proven especially capable of extracting relevant information from a large document or corpus. We may be able to use such LLMs to convert PDF scans of taxing district data into useful SQL tables.
Getting Started
The first thing we need to do is take inventory, first of data, then of LLMs. I would make spreadsheets tracking each of the relevant datapoints.
Data
We need to take stock of what data actually exists that is:
I recommend we start with the following datasets:
TIF information
Taxing district budgets
LLMs
The landscape around LLMs is changing pretty much daily right now. For this project to work, we need to take a snapshot of existing LLMs and determine their capabilities/whether they fit our needs. You'll need to do some exploration in this space. We're specifically looking for LLMs that:
Tasks
Before proceeding to coding, the following tasks should be complete:
Outline
Once the above tasks are complete, it's time to get coding. Since this will likely be a lot of data in various states of processing, I recommend making a data flow diagram + using the specific inventory (from above) to help track things. The coding can be divided into two stages: processing and package updates.
Processing
Broadly, you'll need to come up with a data collection schema that divides things into raw, processed, and completed buckets. We can create a new S3 bucket/dir you can use to store each stage. This will be the stage actually using LLMs. We can scope it out further as we get closer to this stage.
data-raw/
, though we may not want the raw data itself therePackage updates
Once parsing is complete, the collected data needs to be added to the actual PTAXSIM database. This will be much simpler than the processing stage:
data-raw/create_db.sql
to add new table definitions for your finished datadata-raw/
that pulls the processed data from S3 and loads it into the SQLite DB (viadata-raw/create_db.R
)vignettes/
describing what your data is and how to use itAdditional Requirements
The text was updated successfully, but these errors were encountered: