-
Notifications
You must be signed in to change notification settings - Fork 998
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reset dec='\0' to detect numbers while dec is undecided #6445
Conversation
Generated via commit a042c90 Download link for the artifact containing the test results: ↓ atime-results.zip Time taken to finish the standard R installation steps: 3 minutes and 22 seconds Time taken to run |
Okay, I just looked at the discussions of the standard, and it actually allows the comma separator (which is surprising). My concern is data that potentially mixes separators, because even preferring t <- fread(text="x\tDatetime\n1,1\t2023-10-12T06:53:53,123Z", sep="\t")
fwrite(t, file="", dec=",", sep=";")
So this fix (which, if I understand correctly, is trying to reparse with the other separator if one fails) should be OK? Here's a sample of my real data that tripped this issue, which was output by data.table's
P.S. Datetimes are weird. Were you aware that ISO 8601 allows fractional values other than seconds? E.g. |
Oh, nice example. I think that's a follow-up issue for
Line 1922 in 80c46a0
I think this is "good enough". We could possibly support
Yea, no thanks 😂 TBH it would not be that hard to support (switch to double parsing here, and don't fail for missing some components), but I doubt it's common enough to be worth maintaining: Line 1032 in 80c46a0
Some nuance, this fix is only about the "auto-detect" phase of |
From #6454, this point is moot from data.table -- |
Checking in @ben-schwen, got time for a review? Would be nice to get this last part of the 1.16.2 patch submitted. |
fread(text="t\n2023-10-12T06:53:53.123Z", colClasses="nanotime") # works
fread(text="t\n2023-10-12T06:53:53,123Z", colClasses="nanotime", sep=";") # does not |
cdebfe4
to
a042c90
Compare
Got another "refuse to merge unrelated histories" thing here, strange. Reinstated the changes but lost the history again :\ |
That's an R side issue: as.nanotime("2023-10-12T06:53:53,123Z")
# Error: Parse error on 2023-10-12T06:53:53,123Z But I'm reminded of #6107, there may be a way to get it returned to R as
Out of scope here though, filing for follow-up: #6500 |
LGTM, ty! |
Closes #6440.
dec='\0'
is reset each time when moving on to the next column:data.table/src/fread.c
Line 1250 in 80c46a0
The issue here really is isolated to timestamps with milliseconds: within a column, once
dec=','
is tried & fails to parse the timestamp (as a real number),dec=','
is left dangling --> the timestamp fails to parse. This only matters for "higher" types than real numbers where double parsing is attempted, i.e. only timestamps.Because
IS_DEC_TYPE()
excludes timestamp on master, and hence whether a timestamp parses with,
doesn't count to the "race" vs.dec='.'
, the timestamp also fails to parse with,
as the microsecond separator:But this can be nudged to work in files with fields recognized as real numbers on
master
:One incidental change of the PR as of now is that ISO timestamps will parse correctly with
,
as the millisecond separator; @kav2k confirmed this is OK and in fact recognized by the standard.