Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test urllib fix #660

Merged
merged 15 commits into from
Jan 2, 2025
Merged

Test urllib fix #660

merged 15 commits into from
Jan 2, 2025

Conversation

wrridgeway
Copy link
Member

@wrridgeway wrridgeway commented Dec 3, 2024

This PR makes sure all column names are retrieved from an open data asset and that if any chunk sent to socrata is larger than 10,000 rows it gets sub-chunked into chunks with less than 10k rows.

It also uses a session object from the requests library in order to pool connections. We were getting 500 errors in the past, possibly because too many requests were being sent, and switching to connection pooling resolved that.

Comment on lines +101 to +107
f"https://datacatalog.cookcountyil.gov/resource/{asset_id}"
)
.json()[0]
.keys()
.headers["X-SODA2-Fields"]
.replace('"', "")
.strip("[")
.strip("]")
.split(",")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was very naively pulling asset column names before. If the first row of an asset had NULL values for a column, we wouldn't get back that column header. This ensures we get all column headers.

@wrridgeway wrridgeway marked this pull request as ready for review December 26, 2024 19:11
@wrridgeway wrridgeway requested a review from a team as a code owner December 26, 2024 19:11
Copy link
Contributor

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! No huge concerns, but a bunch of small questions below.

socrata/socrata_upload.py Outdated Show resolved Hide resolved
Comment on lines 16 to 19
s.auth = (
str(os.getenv("SOCRATA_USERNAME")),
str(os.getenv("SOCRATA_PASSWORD")),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] I see that we were doing this before, so I get that this auth practice is already on the main branch, but now that my memory is refreshed it's making me wonder -- is there a way to auth to the API using only a token instead of also needing to set our account username/password? Without understanding the details, it strikes me as a super weird pattern (the token should be able to perform auth on its own), and I'm a bit nervous reading and sending username/password given that a leaked username/password could cause more damage than a leaked token.

Copy link
Member Author

@wrridgeway wrridgeway Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be up to something I don't understand about auth'ing for their API, but the token is not sufficient. I get an authentication_required error if I don't auth here.

socrata/socrata_upload.py Outdated Show resolved Hide resolved
socrata/socrata_upload.py Outdated Show resolved Hide resolved
socrata/socrata_upload.py Outdated Show resolved Hide resolved
socrata/socrata_upload.py Outdated Show resolved Hide resolved
data=input_data,
auth=auth,
)
print(response.content)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] What sort of content do we expect to get back from the response? I just want to make absolutely sure that it can't contain anything sensitive.

Copy link
Member Author

@wrridgeway wrridgeway Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This content is pretty harmless (I think): b'{\n "Errors" : 0,\n "Rows Created" : 0,\n "Rows Updated" : 10000,\n "Rows Deleted" : 0\n}\n'

socrata/socrata_upload.py Outdated Show resolved Hide resolved
@wrridgeway
Copy link
Member Author

Thanks for the comments @jeancochrane , these were really helpful.

@wrridgeway wrridgeway merged commit b5840a7 into master Jan 2, 2025
7 checks passed
@wrridgeway wrridgeway deleted the decrease-socrata-api-chunk-size branch January 2, 2025 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants