Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of TableRegressionModel by storing formula directly in model #339

Merged
merged 18 commits into from
Nov 20, 2022

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Oct 18, 2019

This is an exploration of what is needed to get rid of TableRegressionModel (in GLM to help designing the correct API in StatsModels (JuliaStats/StatsModels.jl#32). Overall changes are incredibly limited.

The approach taken here is to have packages define fit(::Type{M}, f::FormulaTerm, data, ...) for their model type M and call modelmatrix(f::FormulaTerm, data, contrasts) to get the model matrix from that. Model objects then have to store the formula, which can be retrieved using the formula function which is added to the StatisticalModel API. fit can still allow passing matrices directly without any formula, in which case it is nothing.

Thanks to the formula generic function, fallback definitions can be provided in StatsModels for many functions, like coefnames or predict. The latter is the most tricky case as missing values have to be handled: the StatsModels fallback method allocates vectors and passes them to predict! (which has to be defined by the package), and returns them.

For simplicity, the PR currently includes a few functions which should defined in StatsModels once we have agreed on the API. modelmatrix already exists there and just needs to handle missing values.


For reference, here are notes @kleinschmidt wrote after we discussed this issue:

  1. Remove the TableStatisticalModel wrapper
  2. Repurpose ModelFrame as a data transformer that keeps track of, e.g., which rows in the model matrix have missing values (in order to map the missings-free array rows onto the underlying table rows).
  3. By including the formula in the model frame, this allows for removing rows from the arrays that are introduced by terms like lead and lag, as long as the filtering of missing values happens after modelcols is called
  4. Models are responsible for holding onto the model frame. When the input data is not in the form of a table when the model is constructed, a fallback IdentityTransformer will be a placeholder. (Or maybe Nothing).
  5. Part of the api will be an accessor function transformer(::MyModel) which will return the transformer (ModelFrame, IdentityTransformer, etc.). This then allows StatsModels.jl to define generic methods like coeftable and coefnames that are sensitive to the presence of a model frame and the formula it contains.

And an updated summary:

  1. maybe we want to think of terms as "lazy views" into the underlying data. with this philosophy, we could think of having an additional step of apply_data, which annotates each term with a pointer to the underlying data, which are then coalesced into array-likes with modelcols or Tables.matrix or something
  2. then we can think of missing-dropping as another lazy view that only views rows that have no missings. we can apply this to the original table OR to the "table" that's generated on the output end. this would be implemented as a wrapper around the table
  3. treating the missing-dropping as a view has the advantage of allowing us to naturally write predicted values into the corresponding rows in the original table (or at least in a way that aligns with those rows).
  4. this also has the advantage of making the formula (and even individual terms) completely self-contained data transformers which eliminates the need for a ModelFrame struct

Copy link
Member

@kleinschmidt kleinschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking pretty good actually. I'm pleased to see how minimal the required changes are!

The idea is that we're replacing ModelFrame and ModelMatrix with modelmatrix right? And requiring that models store the returned formula? That seems like a good "separation of concerns": you're splitting out the convenience of the one-step schema-apply-modelcols stuff that happens in the ModelFrame/ModelMatrix constructors from the structural assumptions of those as structs. That feel like a good division to make, to me: it doesn't lock anyone into using modelmatrix (e.g., if a package like MixedModels needs more control over what happens in that pipeline) but still provides a convenient fallback.

I think the one thing I'm still not sure about is how you'd handle the "missing mask" that ModelFrame currently stores. Committing to storing the formula means that we'd have to find a way to incorporate that kind of information into the formula itself; I'm in favor of that approach in principle but I haven't really worked out the consequences of that choice.

@nalimilan
Copy link
Member Author

Yes, actually it doesn't appear that ModelFrame and ModelMatrix are needed now. So indeed packages are free to avoid modelmatrix and call individual steps instead. Though it would be good to make modelmatrix as general as possible so that most packages can use it: that makes it easier to change the pipeline if needed without breaking too many packages.

Regarding the missing mask, that indeed raises an issue with the current state of the PR: it doesn't store the mask (nor the data). That means functions like fitted (not sure if there are others) return the fitted values without the missing values from the original data.

There's also a related issue with the predict/predict! system I described above: in that approach, predict includes missing values, but predict! skips them. That sounds inconsistent/arbitrary. We should probably find another mechanism, like predict_nonmissing! or something using dispatch. The same approach could be used to define a fallback for fitted if we stored the missing mask in the formula.

But is the formula the best place to store the missing mask? That sounds like a slight abuse, since that's not an information that can be known statically.

@kleinschmidt
Copy link
Member

My philosophy is that terms should be self-contained table-to-array transformers, including FormulaTerm. Whether or not things like the missing mask should be part of that transformation I'm not so clear on. But the missing mask probably at least should be aware of the formula so that things like lead/lag can be supported. Then again, you could also leave it up to the modeling package what to do with missing values in the arrays and make it part of the API that modelcols can return arrays with missing values. But that seems a little undesirable to me.

@nalimilan
Copy link
Member Author

Right, but does that say anything about where the missingness mask should live? Terms that can generate missing values (like lead and lag) just need to mark corresponding observations as missing in the mask, and either no matrix row would be generated for them (if we can sync with other terms), or they would be dropped after the fact. In both cases, that operation can happen without storing the mask in the formula, right?

For example, modelmatrix could return the missingness mask. But that wouldn't be convenient for model implementations since they would have to pass it to all functions, and defining formula wouldn't be enough to get the nice fallback definitions to work. So I see two solutions:

  • store the mask in the formula
  • store it in a structure which behaves like a formula but also stores the missingness mask

I agree modeling packages shouldn't have to deal with missing values by default, especially since it can get tricky with lead and lag and because we could implement optimizations in common cases.

@kleinschmidt
Copy link
Member

And as far as making modelmatrix as general as possible, I think returning a tuple of arrays is the right way to go. That's all MixedModels needs I think. I'm not sure about the econometrics models though (@Nosferican or @matthieugomez could weigh in...)

@Nosferican
Copy link
Contributor

Nosferican commented Oct 19, 2019

In Econometrics.jl and similar packages there are multiple intermediate models (e.g., instrumental variable models with two stages, or transformations such as features absorption). A formula also has multiple parts since there are multiple model components (response, exogenous features, endogenous features, instruments, high-dimensional features to absorb, etc.). At least in my package, I don't rely on ModelFrame or ModelMatrix and just have separate calls to modelcols for each parsed/split Formula. The models store the Formulas as part of the model. I would love to see tools in the panel/temporal context to make the model building less involved in each package. It is in my to-do list to work out a nice predict(model, newdata) which I believe it should be doable, but haven't implement it just yet... Some term names change a bit, especially for multi-response models like mlogit and I have to keep track of dropped terms (i.e., linearly deficient). As for modelmatrix(model::RegressionModel) I think it should return a Matrix{<:Real}.
It might be good to have in the modelcols or related,

  • eterms I believe there were some time ago to keep track of the features involved (important for doing marginal effects such as with logistic regression)
  • mask are nice especially after lead/lag/diff and similar operators.
    These are fine to obtain I think, mask by checking any missing in rows in each model term. eterm I 🤔 Formula had it, but not sure it that was only ModelFrame... that one would be good to keep.

@nalimilan nalimilan force-pushed the nl/tableregressionmodel branch from 32c1b9c to b670a14 Compare June 19, 2022 17:35
@codecov-commenter
Copy link

codecov-commenter commented Jun 19, 2022

Codecov Report

Base: 87.39% // Head: 88.75% // Increases project coverage by +1.35% 🎉

Coverage data is based on head (5c6bf20) compared to base (9dc4d6b).
Patch coverage: 97.52% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #339      +/-   ##
==========================================
+ Coverage   87.39%   88.75%   +1.35%     
==========================================
  Files           7        8       +1     
  Lines         952     1040      +88     
==========================================
+ Hits          832      923      +91     
+ Misses        120      117       -3     
Impacted Files Coverage Δ
src/GLM.jl 50.00% <ø> (ø)
src/ftest.jl 100.00% <ø> (ø)
src/glmfit.jl 81.79% <94.28%> (+0.77%) ⬆️
src/lm.jl 94.26% <97.43%> (+0.93%) ⬆️
src/deprecated.jl 100.00% <100.00%> (ø)
src/linpred.jl 91.17% <100.00%> (+5.56%) ⬆️
src/negbinfit.jl 82.66% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@nalimilan nalimilan marked this pull request as ready for review June 19, 2022 20:43
@nalimilan
Copy link
Member Author

I've rebased and cleaned the PR. For now I propose we just merge this to get rid of TableRegressionModel while keeping feature parity with the current release. AFAICT, we don't have to find a solution to the issue of how to store the missingness mask for that. Later we can try to improve GLM (e.g. by allowing predict to insert missing values where appropriate) and StatsModels (by moving some convenience methods like predict and a simpler modelmatrix from this PR there).

@nalimilan nalimilan requested review from kleinschmidt and palday June 19, 2022 20:43
Copy link
Member

@kleinschmidt kleinschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is looking great, thanks for resurrecting it. I do want to see the coverage to make sure that all the new fiddly bits are actually covered by the tests. I don't see it in the checks, but from manually looking up the PR in codecov (https://app.codecov.io/gh/JuliaStats/GLM.jl/compare/339/diff) it looks like one place where I think I found a bug isn't covered (predictions with non-nothing intervals) so that at least needs some tests.

src/linpred.jl Outdated Show resolved Hide resolved
src/lm.jl Outdated Show resolved Hide resolved
src/lm.jl Outdated Show resolved Hide resolved
src/glmfit.jl Outdated Show resolved Hide resolved
src/glmfit.jl Outdated Show resolved Hide resolved
src/linpred.jl Outdated Show resolved Hide resolved
src/deprecated.jl Outdated Show resolved Hide resolved
src/glmfit.jl Outdated Show resolved Hide resolved
src/glmfit.jl Outdated Show resolved Hide resolved
src/glmfit.jl Outdated
l::Link=canonicallink(d);
offset::Union{AbstractVector, Nothing} = nothing,
wts::Union{AbstractVector, Nothing} = nothing,
dofit::Bool = true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they're calling fit instead of a constructor, I kinda feel like we shouldn't add the complexity of dofit kwarg.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that doesn't make a lot of sense, but it's supported currently so I'd say we'd better keep supporting it. I've added deprecations so that we can remove them at some point.

$PREDICT_COMMON
"""
function predict!(res::Union{AbstractVector,
NamedTuple{(:prediction, :lower, :upper),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we allow Table.ColumnTable? Then we have an intermediate method that takes res::Any and just called Table.columntable(res) to invoke this method making it possible to pass e.g. DataFrames

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do that, but TBH I don't expect people to use this method a lot (if at all). I mainly added it because it makes the design more composable and because I needed it internally. So I'd be inclined to wait and see whether people request it before making the code more complex.

Copy link
Member

@palday palday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very close, but some additional nitpicking

src/glmfit.jl Outdated Show resolved Hide resolved
src/linpred.jl Outdated Show resolved Hide resolved
src/lm.jl Outdated Show resolved Hide resolved
src/lm.jl Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
@nalimilan nalimilan requested a review from palday October 19, 2022 20:08
Copy link
Member

@kleinschmidt kleinschmidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think in general this is great, thanks for doing all this! my only quibble is that there are a few places where inconsistencies with the old behavior might sneak in:

  • printing does not print the formula anymore
  • apply_schema previously woudl get teh ACTUAL type of model being fit; now it just get LinPredModel which may cause issues if someone has defined some special overloads for a particular subtype. instead, I'd suggest passing the type parameer M to modelframe as a fourth argument.

test/runtests.jl Outdated Show resolved Hide resolved
nalimilan and others added 2 commits October 21, 2022 09:40
@nalimilan
Copy link
Member Author

Good catch. While we're at it, I propose that we no longer print type parameters, which are not super useful and obscure the interesting part of the output. See last commit. If we support switching between QR and Cholesky in the future, we may print only the name of the decomposition without all the parameters.

I've also changed model_frame to take the model type as you suggest.

@bkamins
Copy link
Contributor

bkamins commented Oct 21, 2022

I propose that we no longer print type parameters

For standard display I think we should print as little as possible. In my lectures I always need to tell my students "ignore this type info as it is not relevant to you".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants