feat: Add IRS family tax credit webpages as a new dataset #178

yoomlam · 2025-01-14T21:34:11Z

Ticket

https://navalabs.atlassian.net/browse/DST-672

Changes

Add IRS web scraper for https://www.irs.gov/credits-deductions/family-dependents-and-students-credits
Add ingestion scripts to Makefile and poetry

Testing

TBD

yoomlam · 2025-01-14T21:38:18Z

My process:

Use web browser to identify patterns across webpages (e.g., heading structure, common CSS classes) to do the scraping
Test scraping in the Scrapy shell: cd src/ingestion/; scrapy shell https://www.irs.gov/credits-deductions/family-dependents-and-students-credits

    # Grab the title for document.name
    title = response.css("h1.pup-page-node-type-article-page__title::text").get().strip()
    # Get element with non-boilerplate content
    pup = response.css("div.pup-main-container").get()
    # Remove elements to declutter desired content
    response.css("div.sidebar-left").drop()
    # Re-query after dropping
    pup = response.css("div.pup-main-container").get()
    # Convert to markdown
    import html2text
    h2t = html2text.HTML2Text()
    h2t.body_width = 0
    h2t.wrap_links = False
    # Check that it has all the desired content
    print(h2t.handle(pup))

Incorporate code into a Scrapy spider
Run the spider:
- Set CLOSESPIDER_ERRORCOUNT = 1 in app/src/ingestion/scrapy_dst/settings.py so that it stops on the first error
- DEBUG_SCRAPINGS=true poetry run scrape-irs-web
- Keep an eye on the cache in src/ingestion/.scrapy/httpcache/
Update spider with assertions and logger warnings to identify webpages that don't meet expectations
- Use allow, deny, and restrict_css to limit the crawl scope of the spider
Examine irs_web_scrapings.json-pretty.json
Iterate on the above

Once the JSON file looks good enough:

Ingest: make ingest-irs-web DATASET_ID="IRS" BENEFIT_PROGRAM="tax credit" BENEFIT_REGION="US" FILEPATH=src/ingestion/irs_web_scrapings.json INGEST_ARGS="--skip_db"
Examine markdown files under irs_web_md/
Create and post zip file in GDrive for review: zip irs_web_md.zip -r irs_web_md/

github-actions · 2025-01-14T21:39:28Z

☂️ Python Coverage

current status: ❌

Overall Coverage

Lines	Covered	Coverage	Threshold	Status
2930	2621	89%	80%	🟢

New Files

File	Coverage	Status
app/src/ingest_irs_web.py	85%	🔴
TOTAL	85%	🔴

Modified Files

File	Coverage	Status
app/src/ingest_edd_web.py	90%	🟢
app/src/ingest_la_county_policy.py	94%	🟢
TOTAL	92%	🟢

updated for commit: 481ee95 by action🐍

yoomlam added 4 commits January 14, 2025 13:08

return None for prep_json_item()

7f37e12

add IRS web scraper and ingestion

3fc80e2

fix "pages/" folder

36614b2

irs_spider.py works

cfad064

yoomlam marked this pull request as draft January 14, 2025 21:52

yoomlam added 2 commits January 14, 2025 15:59

de-lint

65a4d60

add boilerplate test_ingest_irs_web.py

481ee95

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add IRS family tax credit webpages as a new dataset #178

feat: Add IRS family tax credit webpages as a new dataset #178

yoomlam commented Jan 14, 2025

yoomlam commented Jan 14, 2025

github-actions bot commented Jan 14, 2025 •

edited

Loading

feat: Add IRS family tax credit webpages as a new dataset #178

Are you sure you want to change the base?

feat: Add IRS family tax credit webpages as a new dataset #178

Conversation

yoomlam commented Jan 14, 2025

Ticket

Changes

Testing

yoomlam commented Jan 14, 2025

github-actions bot commented Jan 14, 2025 • edited Loading

☂️ Python Coverage

Overall Coverage

New Files

Modified Files

github-actions bot commented Jan 14, 2025 •

edited

Loading