Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPDB: clean bins #1399

Open
sf-dcp opened this issue Jan 21, 2025 · 5 comments · May be fixed by #1405
Open

CPDB: clean bins #1399

sf-dcp opened this issue Jan 21, 2025 · 5 comments · May be fixed by #1405
Assignees
Labels
bug Something isn't working db-cpdb

Comments

@sf-dcp
Copy link
Contributor

sf-dcp commented Jan 21, 2025

This issue is the result of investigation of CPDB nightly build fail.

TLDR: There are 2 problems that need to be fixed: change pre-processing of doitt_buildingfootprints and add logic to clean cpdb/data/dcp_cpdb_agencyverified.csv source dataset.

Why the build is failing?

In this place of the build, the bin column of the dcp_cpdb_agencyverified table has invalid values that can't be casted to an integer. Example of invalid values from that column:

Image

Why has the build started failing now?

CPDB started failing after we updated doitt_buildingfootprints recipe to use ingest. This dataset has remained mostly the same except the bin string representation: library version stored bin as decimal string (ex: 123.0) and ingest - integer string (ex. 123).

Then, this dataset is cleaned with our pre-processing python script where we remove records with invalid bins here:

Image

This specific line of code doesn't act as expected. It removes ALL records if bin values are not integer strings which was the case with previous doitt_buildingfootprints dataset. The doitt_buildingfootprints table was empty by the time the join with the dcp_cpdb_agencyverified table was performed, not triggering the bin casting issue. With the new doitt_buildingfootprints version, the table is non-empty and the join is attempted.

I submitted a GH issue for pandas here.

@sf-dcp sf-dcp added db-cpdb bug Something isn't working labels Jan 21, 2025
@sf-dcp sf-dcp moved this from New to 📬 Next in Data Engineering Jan 21, 2025
@alexrichey
Copy link
Contributor

@sf-dcp For cpdb/data/dcp_cpdb_agencyverified.csv since we have that checked in, should we just manually remove the ? and any other strange characters we've found? We might also want to investigate if those have been filtered out, causing issues in previous versions.

Then for doitt_buildingfootprints we should actually cast those bins to integers in ingest? It seems like str.isnumeric() is actually doing the intended job here. (is. checking whether each character in the string is numeric)

@damonmcc
Copy link
Member

I like @alexrichey's idea to remove all ? from dcp_cpdb_agencyverified.csv. if a bin like 3076115? never exists in doitt_buildingfootprints, no need to suffer from our own invalid csv

maybe we try that, rebuild, and see how it goes

on cpdb/python/doitt_buildingfootprints.py and 01_preprocessing.sh which calls it, I'm not a fan of how it drops rows and then drops the source table. would probably help to be able to see all the bins

@alexrichey
Copy link
Contributor

I'd definitely be curious to know the impact of a bin like 3076115? in that data. What was it supposed to be, and has it caused errors in previous versions?

Also agree with @damonmcc that the preprocessing isn't ideal - it should be done in SQL, and with immutability in mind. Don't feel strongly about changing that now though.

@sf-dcp
Copy link
Contributor Author

sf-dcp commented Jan 23, 2025

Thank you both! Will work on a PR with your feedback in mind.

@sf-dcp
Copy link
Contributor Author

sf-dcp commented Jan 27, 2025

Then for doitt_buildingfootprints we should actually cast those bins to integers in ingest? It seems like str.isnumeric() is actually doing the intended job here. (is. checking whether each character in the string is numeric)

Noting here that ingest in this case doesn't solve the problem because pandas casts the column into decimal data type when reading the sql table. Note, the sql table does have the correct integer data type.

With rolling out ingest, I think we should try getting rid of python pre-processing steps across products.

@sf-dcp sf-dcp linked a pull request Jan 28, 2025 that will close this issue
@damonmcc damonmcc moved this from 📬 Next to 🏗 In progress in Data Engineering Feb 3, 2025
@damonmcc damonmcc moved this from 🏗 In progress to 🔍 In review in Data Engineering Feb 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working db-cpdb
Projects
Status: 🔍 In review
Development

Successfully merging a pull request may close this issue.

3 participants