-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test urllib fix #660
Test urllib fix #660
Conversation
f"https://datacatalog.cookcountyil.gov/resource/{asset_id}" | ||
) | ||
.json()[0] | ||
.keys() | ||
.headers["X-SODA2-Fields"] | ||
.replace('"', "") | ||
.strip("[") | ||
.strip("]") | ||
.split(",") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was very naively pulling asset column names before. If the first row of an asset had NULL
values for a column, we wouldn't get back that column header. This ensures we get all column headers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! No huge concerns, but a bunch of small questions below.
socrata/socrata_upload.py
Outdated
s.auth = ( | ||
str(os.getenv("SOCRATA_USERNAME")), | ||
str(os.getenv("SOCRATA_PASSWORD")), | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question, non-blocking] I see that we were doing this before, so I get that this auth practice is already on the main branch, but now that my memory is refreshed it's making me wonder -- is there a way to auth to the API using only a token instead of also needing to set our account username/password? Without understanding the details, it strikes me as a super weird pattern (the token should be able to perform auth on its own), and I'm a bit nervous reading and sending username/password given that a leaked username/password could cause more damage than a leaked token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be up to something I don't understand about auth'ing for their API, but the token is not sufficient. I get an authentication_required
error if I don't auth here.
socrata/socrata_upload.py
Outdated
data=input_data, | ||
auth=auth, | ||
) | ||
print(response.content) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Question, non-blocking] What sort of content do we expect to get back from the response? I just want to make absolutely sure that it can't contain anything sensitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This content is pretty harmless (I think): b'{\n "Errors" : 0,\n "Rows Created" : 0,\n "Rows Updated" : 10000,\n "Rows Deleted" : 0\n}\n'
Thanks for the comments @jeancochrane , these were really helpful. |
This PR makes sure all column names are retrieved from an open data asset and that if any chunk sent to socrata is larger than 10,000 rows it gets sub-chunked into chunks with less than 10k rows.
It also uses a session object from the requests library in order to pool connections. We were getting 500 errors in the past, possibly because too many requests were being sent, and switching to connection pooling resolved that.