-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get rid of TableRegressionModel by storing formula directly in model #339
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking pretty good actually. I'm pleased to see how minimal the required changes are!
The idea is that we're replacing ModelFrame
and ModelMatrix
with modelmatrix
right? And requiring that models store the returned formula? That seems like a good "separation of concerns": you're splitting out the convenience of the one-step schema-apply-modelcols stuff that happens in the ModelFrame
/ModelMatrix
constructors from the structural assumptions of those as structs. That feel like a good division to make, to me: it doesn't lock anyone into using modelmatrix
(e.g., if a package like MixedModels needs more control over what happens in that pipeline) but still provides a convenient fallback.
I think the one thing I'm still not sure about is how you'd handle the "missing mask" that ModelFrame
currently stores. Committing to storing the formula means that we'd have to find a way to incorporate that kind of information into the formula itself; I'm in favor of that approach in principle but I haven't really worked out the consequences of that choice.
Yes, actually it doesn't appear that Regarding the missing mask, that indeed raises an issue with the current state of the PR: it doesn't store the mask (nor the data). That means functions like There's also a related issue with the But is the formula the best place to store the missing mask? That sounds like a slight abuse, since that's not an information that can be known statically. |
My philosophy is that terms should be self-contained table-to-array transformers, including |
Right, but does that say anything about where the missingness mask should live? Terms that can generate missing values (like For example,
I agree modeling packages shouldn't have to deal with missing values by default, especially since it can get tricky with |
And as far as making |
In Econometrics.jl and similar packages there are multiple intermediate models (e.g., instrumental variable models with two stages, or transformations such as features absorption). A formula also has multiple parts since there are multiple model components (response, exogenous features, endogenous features, instruments, high-dimensional features to absorb, etc.). At least in my package, I don't rely on
|
32c1b9c
to
b670a14
Compare
Codecov ReportBase: 87.39% // Head: 88.75% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## master #339 +/- ##
==========================================
+ Coverage 87.39% 88.75% +1.35%
==========================================
Files 7 8 +1
Lines 952 1040 +88
==========================================
+ Hits 832 923 +91
+ Misses 120 117 -3
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
I've rebased and cleaned the PR. For now I propose we just merge this to get rid of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is looking great, thanks for resurrecting it. I do want to see the coverage to make sure that all the new fiddly bits are actually covered by the tests. I don't see it in the checks, but from manually looking up the PR in codecov (https://app.codecov.io/gh/JuliaStats/GLM.jl/compare/339/diff) it looks like one place where I think I found a bug isn't covered (predictions with non-nothing
intervals) so that at least needs some tests.
src/glmfit.jl
Outdated
l::Link=canonicallink(d); | ||
offset::Union{AbstractVector, Nothing} = nothing, | ||
wts::Union{AbstractVector, Nothing} = nothing, | ||
dofit::Bool = true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if they're calling fit
instead of a constructor, I kinda feel like we shouldn't add the complexity of dofit
kwarg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that doesn't make a lot of sense, but it's supported currently so I'd say we'd better keep supporting it. I've added deprecations so that we can remove them at some point.
$PREDICT_COMMON | ||
""" | ||
function predict!(res::Union{AbstractVector, | ||
NamedTuple{(:prediction, :lower, :upper), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we allow Table.ColumnTable
? Then we have an intermediate method that takes res::Any
and just called Table.columntable(res)
to invoke this method making it possible to pass e.g. DataFrames
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that, but TBH I don't expect people to use this method a lot (if at all). I mainly added it because it makes the design more composable and because I needed it internally. So I'd be inclined to wait and see whether people request it before making the code more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very close, but some additional nitpicking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think in general this is great, thanks for doing all this! my only quibble is that there are a few places where inconsistencies with the old behavior might sneak in:
- printing does not print the formula anymore
apply_schema
previously woudl get teh ACTUAL type of model being fit; now it just getLinPredModel
which may cause issues if someone has defined some special overloads for a particular subtype. instead, I'd suggest passing the type parameerM
tomodelframe
as a fourth argument.
Co-authored-by: Dave Kleinschmidt <[email protected]>
Good catch. While we're at it, I propose that we no longer print type parameters, which are not super useful and obscure the interesting part of the output. See last commit. If we support switching between QR and Cholesky in the future, we may print only the name of the decomposition without all the parameters. I've also changed |
For standard display I think we should print as little as possible. In my lectures I always need to tell my students "ignore this type info as it is not relevant to you". |
This is an exploration of what is needed to get rid of
TableRegressionModel
(in GLM to help designing the correct API in StatsModels (JuliaStats/StatsModels.jl#32). Overall changes are incredibly limited.The approach taken here is to have packages define
fit(::Type{M}, f::FormulaTerm, data, ...)
for their model typeM
and callmodelmatrix(f::FormulaTerm, data, contrasts)
to get the model matrix from that. Model objects then have to store the formula, which can be retrieved using theformula
function which is added to theStatisticalModel
API.fit
can still allow passing matrices directly without any formula, in which case it isnothing
.Thanks to the
formula
generic function, fallback definitions can be provided in StatsModels for many functions, likecoefnames
orpredict
. The latter is the most tricky case as missing values have to be handled: the StatsModels fallback method allocates vectors and passes them topredict!
(which has to be defined by the package), and returns them.For simplicity, the PR currently includes a few functions which should defined in StatsModels once we have agreed on the API.
modelmatrix
already exists there and just needs to handle missing values.For reference, here are notes @kleinschmidt wrote after we discussed this issue:
TableStatisticalModel
wrapperModelFrame
as a data transformer that keeps track of, e.g., which rows in the model matrix have missing values (in order to map the missings-free array rows onto the underlying table rows).modelcols
is calledIdentityTransformer
will be a placeholder. (Or maybeNothing
).transformer(::MyModel)
which will return the transformer (ModelFrame
,IdentityTransformer
, etc.). This then allows StatsModels.jl to define generic methods likecoeftable
andcoefnames
that are sensitive to the presence of a model frame and the formula it contains.And an updated summary:
apply_data
, which annotates each term with a pointer to the underlying data, which are then coalesced into array-likes withmodelcols
orTables.matrix
or somethingModelFrame
struct