-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPDB: clean bins #1399
Comments
@sf-dcp For Then for |
I like @alexrichey's idea to remove all maybe we try that, rebuild, and see how it goes on |
I'd definitely be curious to know the impact of a bin like Also agree with @damonmcc that the preprocessing isn't ideal - it should be done in SQL, and with immutability in mind. Don't feel strongly about changing that now though. |
Thank you both! Will work on a PR with your feedback in mind. |
Noting here that ingest in this case doesn't solve the problem because With rolling out |
This issue is the result of investigation of CPDB nightly build fail.
TLDR: There are 2 problems that need to be fixed: change pre-processing of
doitt_buildingfootprints
and add logic to cleancpdb/data/dcp_cpdb_agencyverified.csv
source dataset.Why the build is failing?
In this place of the build, the
bin
column of thedcp_cpdb_agencyverified
table has invalid values that can't be casted to an integer. Example of invalid values from that column:Why has the build started failing now?
CPDB started failing after we updated
doitt_buildingfootprints
recipe to use ingest. This dataset has remained mostly the same except thebin
string representation: library version storedbin
as decimal string (ex:123.0
) and ingest - integer string (ex.123
).Then, this dataset is cleaned with our pre-processing python script where we remove records with invalid bins here:
This specific line of code doesn't act as expected. It removes ALL records if bin values are not integer strings which was the case with previous
doitt_buildingfootprints
dataset. Thedoitt_buildingfootprints
table was empty by the time the join with thedcp_cpdb_agencyverified
table was performed, not triggering the bin casting issue. With the newdoitt_buildingfootprints
version, the table is non-empty and the join is attempted.I submitted a GH issue for
pandas
here.The text was updated successfully, but these errors were encountered: