Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add IRS family tax credit webpages as a new dataset #178

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

yoomlam
Copy link
Contributor

@yoomlam yoomlam commented Jan 14, 2025

Ticket

https://navalabs.atlassian.net/browse/DST-672

Changes

Testing

TBD

@yoomlam
Copy link
Contributor Author

yoomlam commented Jan 14, 2025

My process:

    # Grab the title for document.name
    title = response.css("h1.pup-page-node-type-article-page__title::text").get().strip()
    # Get element with non-boilerplate content
    pup = response.css("div.pup-main-container").get()
    # Remove elements to declutter desired content
    response.css("div.sidebar-left").drop()
    # Re-query after dropping
    pup = response.css("div.pup-main-container").get()
    # Convert to markdown
    import html2text
    h2t = html2text.HTML2Text()
    h2t.body_width = 0
    h2t.wrap_links = False
    # Check that it has all the desired content
    print(h2t.handle(pup))
  • Incorporate code into a Scrapy spider
  • Run the spider:
    • Set CLOSESPIDER_ERRORCOUNT = 1 in app/src/ingestion/scrapy_dst/settings.py so that it stops on the first error
    • DEBUG_SCRAPINGS=true poetry run scrape-irs-web
    • Keep an eye on the cache in src/ingestion/.scrapy/httpcache/
  • Update spider with assertions and logger warnings to identify webpages that don't meet expectations
    • Use allow, deny, and restrict_css to limit the crawl scope of the spider
  • Examine irs_web_scrapings.json-pretty.json
  • Iterate on the above

Once the JSON file looks good enough:

  • Ingest: make ingest-irs-web DATASET_ID="IRS" BENEFIT_PROGRAM="tax credit" BENEFIT_REGION="US" FILEPATH=src/ingestion/irs_web_scrapings.json INGEST_ARGS="--skip_db"
  • Examine markdown files under irs_web_md/
  • Create and post zip file in GDrive for review: zip irs_web_md.zip -r irs_web_md/

Copy link

github-actions bot commented Jan 14, 2025

☂️ Python Coverage

current status: ❌

Overall Coverage

Lines Covered Coverage Threshold Status
2930 2621 89% 80% 🟢

New Files

File Coverage Status
app/src/ingest_irs_web.py 85% 🔴
TOTAL 85% 🔴

Modified Files

File Coverage Status
app/src/ingest_edd_web.py 90% 🟢
app/src/ingest_la_county_policy.py 94% 🟢
TOTAL 92% 🟢

updated for commit: 481ee95 by action🐍

@yoomlam yoomlam marked this pull request as draft January 14, 2025 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant