-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
question: how do i handle this buggy csv file? #10477
Comments
Interestingly ibis can read it into a table but pandas cannot in your Backend.execute you use Pandas 1349 through 1367 |
There's not a ton Ibis can do beyond whatever options are exposed by the backend here. You're also going to be limited by the correctness of your CSV file, which sounds like it's missing a quote. The reason it errors on the Fundamentally, with incorrect data, you cannot make any piece of software give you a "correct" count without defining what "correct" means. If a correct count means including lines that are incorrect with respect to the thing doing the parsing, then you're out of luck and you need to fix the data. |
Is Ibis mostly abstracting pandas or is it more advanced and using more Arrow? For context I am trialing ibis as a pandas alternative. |
Hey @pauljsymonds -- Ibis is abstracting tabular query engines. It's also useful to think about Ibis as "only" an interface (a dataframe interface). In fact, you can't use Ibis by itself and compute anything, you need to pair it with a backend engine. By default, we use DuckDB for this and that's a pretty good combination. |
DuckDB offers some convenient functionality for Reading Faulty CSV Files that may help you determine where the errors are. Here's an example of using Ibis with the DuckDB backend to demonstrate this behavior. Notice the tab character in Bob's row instead of the comma, which would raise an In [1]: from ibis.interactive import *
In [2]: data = """name,amount
...: Alice,100
...: Bob\t200
...: Charlie,300"""
In [3]: with open("/tmp/example.csv", "w") as f:
...: f.write(data)
...:
In [4]: con = ibis.duckdb.connect()
In [5]: con.read_csv("/tmp/example.csv", sep=",", ignore_errors=True, store_rejects=True)
Out[5]:
┏━━━━━━━━━┳━━━━━━━━┓
┃ name ┃ amount ┃
┡━━━━━━━━━╇━━━━━━━━┩
│ string │ int64 │
├─────────┼────────┤
│ Alice │ 100 │
│ Charlie │ 300 │
└─────────┴────────┘
In [6]: con.table("reject_errors")
Out[6]:
┏━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ scan_id ┃ file_id ┃ line ┃ line_byte_position ┃ byte_position ┃ column_idx ┃ column_name ┃ error_type ┃ csv_line ┃ error_message ┃
┡━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━┩
│ uint64 │ uint64 │ uint64 │ uint64 │ uint64 │ uint64 │ string │ string │ string │ string │
├─────────┼─────────┼────────┼────────────────────┼───────────────┼────────────┼─────────────┼─────────────────┼──────────┼────────────────────────┤
│ 7 │ 0 │ 3 │ 23 │ 30 │ 1 │ amount │ MISSING COLUMNS │ Bob\t200 │ Expected Number of Co… │
└─────────┴─────────┴────────┴────────────────────┴───────────────┴────────────┴─────────────┴─────────────────┴──────────┴────────────────────────┘ |
I am using locally with the default assumption is DuckDb. Is there a discord where I can ask questions, I have a few on various things I have tried with ibis using the docs. |
Ibis uses Zulip for its community chat; we'd be happy to help answer your questions there. Here's a post with more information that may help if you are new to Zulip: Announcing Zulip for Ibis community chat. |
What happened?
I am reading CSV files into ibis using read_csv, this is working fine. however when I then do a .count() the error appears.
it mentions I can do a ignore_errors = True to ignore the line, but then my count would be incorrect.
What version of ibis are you using?
9.0.0
What backend(s) are you using, if any?
DuckDb
Relevant log output
Value with undetermined quote found. when using .count()
Code of Conduct
The text was updated successfully, but these errors were encountered: