-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cooks Distance tests #415
Cooks Distance tests #415
Conversation
For reference, I used the following program to generate the test data with SAS. proc reg data=work.cook_test;
model Y = XA XB / vif ;
output out=rout_cd_col P=Pred cookd=CooksD;
run;
proc reg data=work.cook_test;
model Y = XA ;
output out=rout_cd_col P=Pred cookd=CooksD;
run;
proc reg data=work.cook_test;
model Y = XA / noint ;
output out=rout_cd_col P=Pred cookd=CooksD;
run; |
So looking further into the first issue. |
While adding the following lines and alias method in using StatsModels
function cooksdistance(obj :: StatsModels.TableRegressionModel)
cooksdistance(obj.model)
end manages to get rid of the problems. I am not sure this is the way to go as it introduces a dependency on StatsModel, or is it alright? For the weighted version, it appears that the results differ between SAS and the implementation. I will look into it. |
Yeah this is a wart of the way that GLM integrates with StatsModels at the moment. See JuliaStats/StatsModels.jl#32 for some discussion. The short version is that GLM doesn't at the moment know anything about the formula syntax ( In the long term teh plan is to get rid of that altogether and have packages liek GLM take a dependency on StatsModels, but I'm afraid that keeps getting pushed back because there are some design issues. See #339 for @nalimilan's progress in re-working GLM along those lines. I'm afraid that it's not totally clear to me where that leaves otherwise good contributions like this. I'd hate to see this PR languish for these upstream reasons but I don't see a way around it at this point. |
Okay, that is disappointing. I will still look into the Cook's distance with the weights scenario to complete this task and amend the PR accordingly. This is my first PR, so you might see some weird issues such as I am struck that so much effort is spent establishing an architecture supporting different models. To take the example of the formula discussed in the above conversation. It matters little to me as a user if I use a different formula syntax between GLM LM or MixedModel (for which I will probably still need to do something special to define the fixed and random effects). For instance, the syntax in SAS/Proc REG (see above) is different, but that is a shallow learning effort to use one coming from the other. In contrast to not having features (such as the Cook's distance) that will dictate which tools I use. Maybe the MixedModels guys have found solutions? """
feL(m::LinearMixedModel)
Return the lower Cholesky factor for the fixed-effects parameters, as an `LowerTriangular`
`p × p` matrix.
"""
function feL(m::LinearMixedModel)
XyL = m.L[end]
k = size(XyL, 1)
inds = Base.OneTo(k - 1)
LowerTriangular(view(XyL, inds, inds))
end I will keep looking further. |
Why not just add |
Happy Monday everyone! So I made some changes:
Hence I think this is good to go. Please let me know your comments or if you want me to change something. |
I'm afraid I won't be much help here! I am no expert, I just put my original PR together from the formulas I found online. I had tested it against the most common Python implementation and got identical results IIRC. I ended up not needing the functionality as I moved in a different direction, but I do think it is good to have something like this available somewhere. Also, no need to cite me in the docstring, thanks for thinking of me though 😄 |
Co-authored-by: Milan Bouchet-Valat <[email protected]>
only on my local machine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests updated, awaiting further feedback.
The following points remains open as I do not know how to fix them.
- It turns out that the "F test for model comparison" is broken, although I think it is unrelated as it was broken from the beginning.
- and "NegativeBinomial NegativeBinomialLink Fixed θ " is broken as well but only for Julia nightly. It seems unrelated to the Cook's Distance.
Yes, don't worry about the failures, they are due to unrelated printing changes in Julia 1.6. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely want cooksdistance
to be a method of StatsBase.cooksdistance
, so this requires that the compat entry for StatsBase is updated in Project.toml
:
StatsBase = "0.33.5"
Also need to require the latest StatsModels as test don't pass without it. |
reverting removal
Co-authored-by: Milan Bouchet-Valat <[email protected]>
reflect dependency with StatsBase
add dependency on latest version of StatsModel
@ericqu Sorry for the terse and incomplete reviews I'm taking quick looks during coffee breaks at the day job, so as annoying as many review rounds can be, I hope getting feedback sooner is nonetheless helpful. |
implementing @palday suggestion
@palday thank you for taking the time to review and comment. I implemented the quick changes about the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting very close. I'll let the other reviewers comment on style.
@nalimilan I'll add my QR decomposition trick in a follow-up PR. I can specialize even further for the choldpred to use the existing Cholesky decomposition instead of doing a new QR.
Co-authored-by: Phillip Alday <[email protected]>
Co-authored-by: Phillip Alday <[email protected]>
Co-authored-by: Phillip Alday <[email protected]>
added sources of values used for testing.
Co-authored-by: Phillip Alday <[email protected]>
@palday or @nalimilan I made the change to address that. However, it keeps being presented as an issue for the PR. Is there anything that could be done? |
Codecov Report
@@ Coverage Diff @@
## master #415 +/- ##
==========================================
+ Coverage 81.08% 83.81% +2.73%
==========================================
Files 7 6 -1
Lines 703 766 +63
==========================================
+ Hits 570 642 +72
+ Misses 133 124 -9
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CI failures seem unrelated.
I want to get that explicit matrix inversion changed as soon as possible, but have that flagged in an issue.
I'm happy when @nalimilan is happy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I added some tests, and an export. and use pretty much the original code.
I wanted to try to do a PR and learn something along the way.
Unfortunately, there are a few errors, I am hoping you can help me with it.
I did
dev GLM
in the package manager, and then triedtest GLM
but got some errors in the relevant test:the first error is
Which I believe is a problem with the function signature.
I added a call to r2 as it also use a
obj::LinearModel
as an argument and does not throw an error.I thought about not forcing a type, but I guessed we want to make sure this a linear model and a generalized one.
Hence, I think I am missing something obvious although not sure what yet.
The second error is:
here it seems that when used with weights when calling the stderrror (
stderror(x::LinPredModel) = sqrt.(diag(vcov(x)))
)I suppose some entries in the diag vcov are negative. Could it be related to the weights?