Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Defer as.Date() coercion to R level #6107

Merged
merged 7 commits into from
Sep 8, 2024
Merged

Conversation

MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Apr 29, 2024

Closes #6105

As seen in the tests, this might be a breaking change if any downstreams depend on the output being specifically Date (and not IDate). I am not sure it's possible -- inherits(., "Date") is still TRUE, and relevant methods should just back up to Date methods if not available for IDate. I lean towards just going ahead with this change unless revdeps finds anything concerning, WDYT @jangorecki?

cc also @HughParsonage who may have some extra insights.


One alternative I looked into that won't work is adding "Date" as an alias for "IDate" in the type hierarchy:

data.table/src/freadR.c

Lines 27 to 29 in d19bfef

static int typeSxp[NUT] = {NILSXP, LGLSXP, LGLSXP, LGLSXP, LGLSXP, LGLSXP, INTSXP, REALSXP, REALSXP, REALSXP, REALSXP, INTSXP, REALSXP, STRSXP, REALSXP, STRSXP };
static char typeRName[NUT][10]={"NULL", "logical", "logical", "logical", "logical", "logical", "integer", "integer64", "double", "double", "double", "IDate", "POSIXct", "character", "numeric", "CLASS" };
static int typeEnum[NUT] = {CT_DROP, CT_EMPTY, CT_BOOL8_N, CT_BOOL8_U, CT_BOOL8_T, CT_BOOL8_L, CT_INT32, CT_INT64, CT_FLOAT64, CT_FLOAT64_HEX, CT_FLOAT64_EXT, CT_ISO8601_DATE, CT_ISO8601_TIME, CT_STRING, CT_FLOAT64, CT_STRING};

The problem is that there are files where our rudimentary date parser doesn't detect the date, but as.Date() does, meaning fread.c returns a string --> the hierarchy is broken as there's an attempt to cast string->int32. See in particular this test:

https://github.com/Rdatatable/data.table/blob/d19bfef7026f25bb2d36c17879187d09bcb2b2c3/inst/tests/tests.Rraw#L11043

Copy link
Member Author

MichaelChirico commented Apr 29, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @MichaelChirico and the rest of your teammates on Graphite Graphite

@MichaelChirico MichaelChirico changed the title C-level changes needed Defer as.Date() coercion to R level Apr 29, 2024
@MichaelChirico MichaelChirico requested a review from tdhock April 29, 2024 21:53
@MichaelChirico

This comment was marked as outdated.

@MichaelChirico MichaelChirico force-pushed the fread-date-idate-impl branch from 39b6bfb to 0aa201d Compare May 1, 2024 06:08
Copy link

github-actions bot commented May 1, 2024

Comparison Plot

Generated via commit 92c8f08

Download link for the artifact containing the test results: ↓ atime-results.zip

Time taken to finish the standard R installation steps: 11 minutes and 23 seconds

Time taken to run atime::atime_pkg on the tests: 6 minutes and 27 seconds

@jangorecki
Copy link
Member

Yes, let's see revdeps

@MichaelChirico MichaelChirico marked this pull request as ready for review May 3, 2024 19:21
.ci/atime/tests.R Outdated Show resolved Hide resolved
.ci/atime/tests.R Outdated Show resolved Hide resolved
setup = {
DT = data.table(date=.Date(sample(20000, N, replace=TRUE)))
tmp_csv = tempfile()
fwrite(DT, tmp_csv)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jangorecki I think we should run fread(DT) once here in the setup because of cacheing, right? Or do we need to run in a subprocess? Here the benchmark is really about what happens after the fread.c code is run, only care about differences emerging (1) in freadR.c and/or (2) fread.R post-processing. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(regardless, we see substantial improvement in all 3 cases already)

@MichaelChirico MichaelChirico force-pushed the fread-date-idate-impl branch 3 times, most recently from ca6c94d to e58781f Compare May 4, 2024 06:51
src/freadR.c Outdated
@@ -335,7 +335,7 @@ bool userOverride(int8_t *type, lenOff *colNames, const char *anchor, const int
type[i]=CT_STRING; // e.g. CT_ISO8601_DATE changed to character here so that as.POSIXct treats the date-only as local time in tests 1743.122 and 2150.11
SET_STRING_ELT(colClassesAs, i, tt);
}
} else {
} else if (type[i] != CT_ISO8601_DATE || tt != char_Date) {
type[i] = typeEnum[w-1]; // freadMain checks bump up only not down
if (w==NUT) SET_STRING_ELT(colClassesAs, i, tt);
}
Copy link
Member

@HughParsonage HughParsonage May 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the consequence of a dangling else here? (As in, if the new condition evaluates to false, lines 339-340 would previously be evaluated.) I think a comment for this line would be useful here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the call-out, this is something I wasn't understanding well myself.

The key is we just skip updating type[i] for that case. We might consider rewriting this then as:

type[i] = (type[i] == CT_ISO8601_DATE && tt == char_Date) ? type[i] : typeEnum[w-1];

The "trouble" comes in the next line, because w==NUT in the relevant case here (i.e., "Date" is not a known class for C-side coercion).

So with the simple fix, we're back to R-side as.Date() coercion.

As noted, we might prefer that (retaining fully back-compatible Date output columns), but it'll require a corresponding change in the colClasses=list() branch.

So it's down again to whether the result of this PR should be "cols parsed as IDate and requested as Date are returned as IDate, fully avoiding any coercion" or "we now skip the middle step in IDate->char->Date coercion; for maximum efficiency, request IDate instead of Date".

Anirban166

This comment was marked as outdated.

MichaelChirico added a commit that referenced this pull request Jun 1, 2024
Towards #6105. This PR is a simple refactor to reduce code nesting / improve readability. It makes the actual implementation of #6105, #6107, much simpler.
Base automatically changed from fread-date-idate to master June 1, 2024 06:28
@MichaelChirico
Copy link
Member Author

My recollection is this was ready to submit besides the issue of tidying up the atime part (deferred to follow-up now), and we held off on the potential breaking change near release; submitting now.

@MichaelChirico MichaelChirico merged commit a77e8c2 into master Sep 8, 2024
9 checks passed
@MichaelChirico MichaelChirico deleted the fread-date-idate-impl branch September 8, 2024 04:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

fread: performance difference with/without specifying colClasses
5 participants