-
Notifications
You must be signed in to change notification settings - Fork 994
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Defer as.Date() coercion to R level #6107
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. Join @MichaelChirico and the rest of your teammates on Graphite |
This comment was marked as outdated.
This comment was marked as outdated.
39b6bfb
to
0aa201d
Compare
Generated via commit 92c8f08 Download link for the artifact containing the test results: ↓ atime-results.zip Time taken to finish the standard R installation steps: 11 minutes and 23 seconds Time taken to run |
Yes, let's see revdeps |
3ee6ba4
to
667cf47
Compare
42a5137
to
ddba8ff
Compare
.ci/atime/tests.R
Outdated
setup = { | ||
DT = data.table(date=.Date(sample(20000, N, replace=TRUE))) | ||
tmp_csv = tempfile() | ||
fwrite(DT, tmp_csv) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jangorecki I think we should run fread(DT)
once here in the setup
because of cacheing, right? Or do we need to run in a subprocess? Here the benchmark is really about what happens after the fread.c code is run, only care about differences emerging (1) in freadR.c and/or (2) fread.R post-processing. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(regardless, we see substantial improvement in all 3 cases already)
667cf47
to
74ce6de
Compare
ca6c94d
to
e58781f
Compare
src/freadR.c
Outdated
@@ -335,7 +335,7 @@ bool userOverride(int8_t *type, lenOff *colNames, const char *anchor, const int | |||
type[i]=CT_STRING; // e.g. CT_ISO8601_DATE changed to character here so that as.POSIXct treats the date-only as local time in tests 1743.122 and 2150.11 | |||
SET_STRING_ELT(colClassesAs, i, tt); | |||
} | |||
} else { | |||
} else if (type[i] != CT_ISO8601_DATE || tt != char_Date) { | |||
type[i] = typeEnum[w-1]; // freadMain checks bump up only not down | |||
if (w==NUT) SET_STRING_ELT(colClassesAs, i, tt); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the consequence of a dangling else here? (As in, if the new condition evaluates to false, lines 339-340 would previously be evaluated.) I think a comment for this line would be useful here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the call-out, this is something I wasn't understanding well myself.
The key is we just skip updating type[i]
for that case. We might consider rewriting this then as:
type[i] = (type[i] == CT_ISO8601_DATE && tt == char_Date) ? type[i] : typeEnum[w-1];
The "trouble" comes in the next line, because w==NUT
in the relevant case here (i.e., "Date"
is not a known class for C-side coercion).
So with the simple fix, we're back to R-side as.Date()
coercion.
As noted, we might prefer that (retaining fully back-compatible Date
output columns), but it'll require a corresponding change in the colClasses=list()
branch.
So it's down again to whether the result of this PR should be "cols parsed as IDate and requested as Date are returned as IDate, fully avoiding any coercion" or "we now skip the middle step in IDate->char->Date coercion; for maximum efficiency, request IDate instead of Date".
74ce6de
to
0ea2800
Compare
13624c7
to
5ea25d1
Compare
My recollection is this was ready to submit besides the issue of tidying up the atime part (deferred to follow-up now), and we held off on the potential breaking change near release; submitting now. |
Closes #6105
As seen in the tests, this might be a breaking change if any downstreams depend on the output being specifically Date (and not IDate). I am not sure it's possible --
inherits(., "Date")
is stillTRUE
, and relevant methods should just back up to Date methods if not available for IDate. I lean towards just going ahead with this change unless revdeps finds anything concerning, WDYT @jangorecki?cc also @HughParsonage who may have some extra insights.
One alternative I looked into that won't work is adding
"Date"
as an alias for"IDate"
in thetype
hierarchy:data.table/src/freadR.c
Lines 27 to 29 in d19bfef
The problem is that there are files where our rudimentary date parser doesn't detect the date, but
as.Date()
does, meaningfread.c
returns a string --> the hierarchy is broken as there's an attempt to cast string->int32. See in particular this test:https://github.com/Rdatatable/data.table/blob/d19bfef7026f25bb2d36c17879187d09bcb2b2c3/inst/tests/tests.Rraw#L11043