Postgres Writer dropping data - pushing incomplete data from parquet #242

arpit94 · 2024-07-10T12:54:43Z

What happens?

I have a simple parquet file with two columns (types - bigint and varchar[] -in postgres, INT64 and BYTE_ARRAY in parquet)

When I try to write the data to postgres using the postgres connector, there is data loss happening and not all of the data is making it to postgres.
I am able to successfully able to query the parquet in duckdb itself. (Even the csv export works well)

To Reproduce

ATTACH 'dbname=<dbname> port=<port> user=<user> host=<host> password=<pass>' AS db (TYPE POSTGRES);
SELECT * FROM 'https://github.com/arpit94/duckdb/raw/main/data/parquet-testing/npi.parquet' where npi = 1003000126;

┌────────────┬────────────────────┐
│    npi     │ primary_taxo_codes │
│   int64    │     varchar[]      │
├────────────┼────────────────────┤
│ 1003000126 │ [207R00000X]       │
└────────────┴────────────────────┘

CREATE OR REPLACE TABLE db.public.my_table as FROM 'https://github.com/arpit94/duckdb/raw/main/data/parquet-testing/npi.parquet';
SELECT * FROM db.public.my_table where npi = 1003000126;

┌────────────┬────────────────────┐
│    npi     │ primary_taxo_codes │
│   int64    │     varchar[]      │
├────────────┼────────────────────┤
│ 1003000126 │                    │
└────────────┴────────────────────┘

The same thing works with csv format

COPY (SELECT * FROM 'https://github.com/arpit94/duckdb/raw/main/data/parquet-testing/npi.parquet') TO 'output.csv' (HEADER, DELIMITER ',');
SELECT * FROM 'output.csv' WHERE npi = 1003000126;

┌────────────┬────────────────────┐
│    npi     │ primary_taxo_codes │
│   int64    │      varchar       │
├────────────┼────────────────────┤
│ 1003000126 │ [207R00000X]       │
└────────────┴────────────────────┘

OS:

Ubuntu

DuckDB Version:

1.0.0

DuckDB Client:

CLI tool

Full Name:

Arpit Aggarwal

Affiliation:

Candor Health

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Yes, I have

The text was updated successfully, but these errors were encountered:

Mytherin · 2024-09-03T17:27:02Z

Thanks for the report! I've pushed a fix in #254 - the issue was that we were not resetting an intermediate state correctly leading to additional NULL values creeping in.

Fix #242: correctly reset varchar chunk when casting data in CopyChunk

szarnyasg transferred this issue from duckdb/duckdb Jul 10, 2024

Mytherin added a commit that referenced this issue Sep 3, 2024

Fix #242: correctly reset varchar chunk when casting data in CopyChunk

b47b966

Mytherin mentioned this issue Sep 3, 2024

Fix #242: correctly reset varchar chunk when casting data in CopyChunk #254

Merged

Mytherin closed this as completed in #254 Sep 3, 2024

Mytherin closed this as completed in 018aabb Sep 3, 2024

Mytherin added a commit that referenced this issue Sep 3, 2024

Merge pull request #254 from duckdb/issue242

ef26284

Fix #242: correctly reset varchar chunk when casting data in CopyChunk

Mytherin mentioned this issue Sep 3, 2024

Disappearing data when copying a join result. #211

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postgres Writer dropping data - pushing incomplete data from parquet #242

Postgres Writer dropping data - pushing incomplete data from parquet #242

arpit94 commented Jul 10, 2024

Mytherin commented Sep 3, 2024

Postgres Writer dropping data - pushing incomplete data from parquet #242

Postgres Writer dropping data - pushing incomplete data from parquet #242

Comments

arpit94 commented Jul 10, 2024

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Mytherin commented Sep 3, 2024