Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres Writer dropping data - pushing incomplete data from parquet #242

Closed
2 tasks done
arpit94 opened this issue Jul 10, 2024 · 1 comment · Fixed by #254
Closed
2 tasks done

Postgres Writer dropping data - pushing incomplete data from parquet #242

arpit94 opened this issue Jul 10, 2024 · 1 comment · Fixed by #254

Comments

@arpit94
Copy link

arpit94 commented Jul 10, 2024

What happens?

I have a simple parquet file with two columns (types - bigint and varchar[] -in postgres, INT64 and BYTE_ARRAY in parquet)

When I try to write the data to postgres using the postgres connector, there is data loss happening and not all of the data is making it to postgres.
I am able to successfully able to query the parquet in duckdb itself. (Even the csv export works well)

To Reproduce

ATTACH 'dbname=<dbname> port=<port> user=<user> host=<host> password=<pass>' AS db (TYPE POSTGRES);
SELECT * FROM 'https://github.com/arpit94/duckdb/raw/main/data/parquet-testing/npi.parquet' where npi = 1003000126;
┌────────────┬────────────────────┐
│    npi     │ primary_taxo_codes │
│   int64    │     varchar[]      │
├────────────┼────────────────────┤
│ 1003000126 │ [207R00000X]       │
└────────────┴────────────────────┘
CREATE OR REPLACE TABLE db.public.my_table as FROM 'https://github.com/arpit94/duckdb/raw/main/data/parquet-testing/npi.parquet';
SELECT * FROM db.public.my_table where npi = 1003000126;
┌────────────┬────────────────────┐
│    npi     │ primary_taxo_codes │
│   int64    │     varchar[]      │
├────────────┼────────────────────┤
│ 1003000126 │                    │
└────────────┴────────────────────┘

The same thing works with csv format

COPY (SELECT * FROM 'https://github.com/arpit94/duckdb/raw/main/data/parquet-testing/npi.parquet') TO 'output.csv' (HEADER, DELIMITER ',');
SELECT * FROM 'output.csv' WHERE npi = 1003000126;
┌────────────┬────────────────────┐
│    npi     │ primary_taxo_codes │
│   int64    │      varchar       │
├────────────┼────────────────────┤
│ 1003000126 │ [207R00000X]       │
└────────────┴────────────────────┘

OS:

Ubuntu

DuckDB Version:

1.0.0

DuckDB Client:

CLI tool

Full Name:

Arpit Aggarwal

Affiliation:

Candor Health

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@Mytherin
Copy link
Contributor

Mytherin commented Sep 3, 2024

Thanks for the report! I've pushed a fix in #254 - the issue was that we were not resetting an intermediate state correctly leading to additional NULL values creeping in.

Mytherin added a commit that referenced this issue Sep 3, 2024
Fix #242: correctly reset varchar chunk when casting data in CopyChunk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants