Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calitp-py: to_snakecase should substitute colons #41

Open
lauriemerrell opened this issue Mar 14, 2022 · 4 comments
Open

calitp-py: to_snakecase should substitute colons #41

lauriemerrell opened this issue Mar 14, 2022 · 4 comments

Comments

@lauriemerrell
Copy link
Contributor

BigQuery cannot load column names that contain colons (:). We ran into this with the AirtableToWarehouse operator:

google.api_core.exceptions.BadRequest: 400 POST https://bigquery.googleapis.com/bigquery/v2/projects/cal-itp-data-infra/datasets/airtable/tables?prettyPrint=false: Invalid field name "product:_graas". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 300 characters long.

The column names in that operator are pre-processed with to_snakecase, but the list of punctuation to replace in to_snakecase is manually specified and does not include colons: https://github.com/cal-itp/calitp-py/blob/main/calitp/sql.py#L153

We should add colons to this list.

Things to check:

  • Would this break any current usage of to_snakecase? It would be desirable for Airtable but I'm not sure about other locations
  • Would then need to make sure that Airflow / data-infra get their calitp versions bumped to pick up the change (this is why I am not just doing this to address the current Airtable issue)
@lauriemerrell
Copy link
Contributor Author

@atvaccaro re: cal-itp/data-infra#1218 (review)

Right now to_snakecase does different replacement behavior for quotes (', ", replaced by nothing) than for other non-alphanumeric characters (replaced by -). So I think that's at least part of the reason that it's hard-coded and the reason that I think we need to investigate how this is actually being used before we make any of these broader changes.

I guess we could replace the quotes first and then replace anything non-alphanumeric that remains with _.

@lauriemerrell
Copy link
Contributor Author

If/once we do this, we can revert cal-itp/data-infra#1218 and maybe remove the special handling for # signs in that table too.

@atvaccaro
Copy link
Contributor

Right now to_snakecase does different replacement behavior for quotes (', ", replaced by nothing) than for other non-alphanumeric characters (replaced by -). So I think that's at least part of the reason that it's hard-coded and the reason that I think we need to investigate how this is actually being used before we make any of these broader changes.

Do we know the original reasoning for different replacement characters?

I guess we could replace the quotes first and then replace anything non-alphanumeric that remains with _.

That's a good idea.

@lauriemerrell
Copy link
Contributor Author

Do we know the original reasoning for different replacement characters?

I don't, but I suspect if we look more at the different places where it's used it might become clearer. I believe it's used in the CSVToWarehouse operator so suspect that may be part of it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants