Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cooks Distance tests #415

Merged
merged 30 commits into from
Apr 28, 2021
Merged

Cooks Distance tests #415

merged 30 commits into from
Apr 28, 2021

Conversation

ericqu
Copy link
Contributor

@ericqu ericqu commented Mar 26, 2021

I added some tests, and an export. and use pretty much the original code.

I wanted to try to do a PR and learn something along the way.

Unfortunately, there are a few errors, I am hoping you can help me with it.

I did dev GLM in the package manager, and then tried test GLM but got some errors in the relevant test:

the first error is

r2(t_lm) = 0.9532076162447004
Linear Model Cooks Distance - refers to PR #368 and issue #414: Error During Test at /home/eric/.julia/dev/GLM/test/runtests.jl:63
  Test threw exception
  Expression: isapprox(st_df.CooksD, cooksdistance(t_lm))
  MethodError: no method matching cooksdistance(::StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}})
  Closest candidates are:
    cooksdistance(!Matched::LinearModel) at /home/eric/.julia/dev/GLM/src/lm.jl:251
  Stacktrace:
   [1] top-level scope at /home/eric/.julia/dev/GLM/test/runtests.jl:63
   [2] top-level scope at /home/eric/Repos/julia_install/julia-1.5.3/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
   [3] top-level scope at /home/eric/.julia/dev/GLM/test/runtests.jl:53

Which I believe is a problem with the function signature.
I added a call to r2 as it also use a obj::LinearModel as an argument and does not throw an error.
I thought about not forcing a type, but I guessed we want to make sure this a linear model and a generalized one.

Hence, I think I am missing something obvious although not sure what yet.

The second error is:

r2(t_lm_w) = 0.9525895777180596
Linear Model Cooks Distance - refers to PR #368 and issue #414: Error During Test at /home/eric/.julia/dev/GLM/test/runtests.jl:52
  Got exception outside of a @test
  DomainError with -1.8726905285161664:
  sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
  Stacktrace:
   [1] throw_complex_domainerror(::Symbol, ::Float64) at ./math.jl:33
   [2] sqrt at ./math.jl:573 [inlined]
   [3] _broadcast_getindex_evalf at ./broadcast.jl:648 [inlined]
   [4] _broadcast_getindex at ./broadcast.jl:621 [inlined]
   [5] getindex at ./broadcast.jl:575 [inlined]
   [6] macro expansion at ./broadcast.jl:932 [inlined]
   [7] macro expansion at ./simdloop.jl:77 [inlined]
   [8] copyto! at ./broadcast.jl:931 [inlined]
   [9] copyto! at ./broadcast.jl:886 [inlined]
   [10] copy at ./broadcast.jl:862 [inlined]
   [11] materialize at ./broadcast.jl:837 [inlined]
   [12] stderror(::LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}}) at /home/eric/.julia/dev/GLM/src/linpred.jl:224
   [13] coeftable(::LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}}; level::Float64) at /home/eric/.julia/dev/GLM/src/lm.jl:195
   [14] coeftable(::LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}}) at /home/eric/.julia/dev/GLM/src/lm.jl:194
   [15] coeftable(::StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/eric/.julia/packages/StatsModels/FJnLj/src/statsmodel.jl:174
   [16] coeftable at /home/eric/.julia/packages/StatsModels/FJnLj/src/statsmodel.jl:174 [inlined]
   [17] show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}) at /home/eric/.julia/packages/StatsModels/FJnLj/src/statsmodel.jl:185
   [18] show_delim_array(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::Tuple{StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}}, ::Char, ::Char, ::Char, ::Bool, ::Int64, ::Int64) at ./show.jl:776
   [19] show_delim_array at ./show.jl:761 [inlined]
   [20] show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::Tuple{StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}}) at ./show.jl:794
   [21] _show_default(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::Any) at ./show.jl:406
   [22] show_default at ./show.jl:389 [inlined]
   [23] show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::Any) at ./show.jl:384
   [24] sprint(::Function, ::MethodError; context::Pair{Symbol,Bool}, sizehint::Int64) at ./strings/io.jl:103
   [25] Test.Error(::Any, ::Any, ::Any, ::Any, ::Any) at /home/eric/Repos/julia_install/julia-1.5.3/share/julia/stdlib/v1.5/Test/src/Test.jl:162
   [26] do_test(::Test.ExecutionResult, ::Any) at /home/eric/Repos/julia_install/julia-1.5.3/share/julia/stdlib/v1.5/Test/src/Test.jl:518
   [27] top-level scope at /home/eric/.julia/dev/GLM/test/runtests.jl:66
   [28] top-level scope at /home/eric/Repos/julia_install/julia-1.5.3/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
   [29] top-level scope at /home/eric/.julia/dev/GLM/test/runtests.jl:53
   [30] include(::String) at ./client.jl:457
   [31] top-level scope at none:6
   [32] eval(::Module, ::Any) at ./boot.jl:331
   [33] exec_options(::Base.JLOptions) at ./client.jl:272
   [34] _start() at ./client.jl:506

here it seems that when used with weights when calling the stderrror ( stderror(x::LinPredModel) = sqrt.(diag(vcov(x))) )I suppose some entries in the diag vcov are negative. Could it be related to the weights?

@ericqu
Copy link
Contributor Author

ericqu commented Mar 26, 2021

For reference, I used the following program to generate the test data with SAS.

proc reg data=work.cook_test;
	model Y = XA XB / vif ;
	output out=rout_cd_col P=Pred cookd=CooksD;
run;

proc reg data=work.cook_test;
	model Y = XA ;
	output out=rout_cd_col P=Pred cookd=CooksD;
run;

proc reg data=work.cook_test;
	model Y = XA / noint ;
	output out=rout_cd_col P=Pred cookd=CooksD;
run;

@ericqu ericqu marked this pull request as draft March 26, 2021 13:35
@ericqu
Copy link
Contributor Author

ericqu commented Mar 26, 2021

So looking further into the first issue.
It appears that when calling the lm function to do the fit it eventually calls fit in StatsModels which in turn will redirect calls to r2 and adjr2 to the implementation in LinearModel when adequate (which explains why my call to r2 was working).
This would suggest that to implement the cooksdistance (or any additional methods) there would be a needs to make some changes in StatsModels as well. While this seems useful for some "generic" functions, for specific ones it would seem useful to short-circuit this. And I am not sure how to that adequately.

@ericqu ericqu mentioned this pull request Mar 26, 2021
@ericqu
Copy link
Contributor Author

ericqu commented Mar 26, 2021

While adding the following lines and alias method in GLM/lm.jl:

using StatsModels

function cooksdistance(obj :: StatsModels.TableRegressionModel)
    cooksdistance(obj.model)
end  

manages to get rid of the problems. I am not sure this is the way to go as it introduces a dependency on StatsModel, or is it alright?

For the weighted version, it appears that the results differ between SAS and the implementation. I will look into it.

@kleinschmidt
Copy link
Member

Yeah this is a wart of the way that GLM integrates with StatsModels at the moment. See JuliaStats/StatsModels.jl#32 for some discussion. The short version is that GLM doesn't at the moment know anything about the formula syntax (y ~ 1 + x), it only knows about a y vector and an X matrix. The original idea way back in the day was that StatsModels (formerly part of DataFrames) would hijack the fit(::RegressionModel, ::Formula, ::DataFrame, ...) methods, generate the arrays needed to fit the model, and wrap the whole thing in its own struct (now TableRegressionModel). The hope was that you could The now-obvious shortcoming of that method is the one you're seeing now: only functions which are "blessed" in StatsModels themselves work with that wrapper (see here).

In the long term teh plan is to get rid of that altogether and have packages liek GLM take a dependency on StatsModels, but I'm afraid that keeps getting pushed back because there are some design issues. See #339 for @nalimilan's progress in re-working GLM along those lines.

I'm afraid that it's not totally clear to me where that leaves otherwise good contributions like this. I'd hate to see this PR languish for these upstream reasons but I don't see a way around it at this point.

@ericqu
Copy link
Contributor Author

ericqu commented Mar 27, 2021

Okay, that is disappointing.

I will still look into the Cook's distance with the weights scenario to complete this task and amend the PR accordingly. This is my first PR, so you might see some weird issues such as .vscode/settings.json which I will fix once I know more about GitHub.

I am struck that so much effort is spent establishing an architecture supporting different models. To take the example of the formula discussed in the above conversation. It matters little to me as a user if I use a different formula syntax between GLM LM or MixedModel (for which I will probably still need to do something special to define the fixed and random effects). For instance, the syntax in SAS/Proc REG (see above) is different, but that is a shallow learning effort to use one coming from the other. In contrast to not having features (such as the Cook's distance) that will dictate which tools I use.
While I understand that there needs to be a tradeoff between the inefficiencies resulting from completely independent models (say code duplication) and needed architecture to alleviate their costs, from my (user) perspective, I balance that with the features that interested parties could develop with a lower barrier of entry.

Maybe the MixedModels guys have found solutions?
I see in (https://github.com/JuliaStats/MixedModels.jl/blob/main/src/linearmixedmodel.jl) the following is using a similar signature and is not part of StatsModel.:

"""
    feL(m::LinearMixedModel)
Return the lower Cholesky factor for the fixed-effects parameters, as an `LowerTriangular`
`p × p` matrix.
"""
function feL(m::LinearMixedModel)
    XyL = m.L[end]
    k = size(XyL, 1)
    inds = Base.OneTo(k - 1)
    LowerTriangular(view(XyL, inds, inds))
end

I will keep looking further.

@nalimilan
Copy link
Member

I'm afraid that it's not totally clear to me where that leaves otherwise good contributions like this. I'd hate to see this PR languish for these upstream reasons but I don't see a way around it at this point.

Why not just add cooksdistance to the list of methods that StatsModels delegate to the underlying model? We can't stop adding features waiting for the switch to the new framework, given the absence of progress on that front.

@ericqu
Copy link
Contributor Author

ericqu commented Mar 29, 2021

Happy Monday everyone!

So I made some changes:

  • I added a wrapper so that when called with a StatsModels.TableRegressionModel then use the LinearModel.
  • I tested with the results from SAS and R for the Cook's Distance with weights, and unfortunately, the existing code did not give accurate results. It turns out that calculating the Cooks's Distance when the weights are normalized is relatively simple as only the residuals need to be weighted.
  • I added only a couple of tests so that the testing time isn't too long (I could add some more if you want to).
  • It turns out that the "F test for model comparison" is broken, although I think it is unrelated as it was broken from the beginning.
  • and "NegativeBinomial NegativeBinomialLink Fixed θ " is broken as well but only for Julia nightly. It seems unrelated to the Cook's Distance.
  • I am not sure but I may have added a .gitignore file, let me know if I should delete it.

Hence I think this is good to go. Please let me know your comments or if you want me to change something.
I struggled with git/GitHub; it seems ok to me now. If it isn't, please let me know as well.

@ericqu ericqu marked this pull request as ready for review March 29, 2021 16:01
src/lm.jl Outdated Show resolved Hide resolved
src/lm.jl Outdated Show resolved Hide resolved
src/lm.jl Outdated Show resolved Hide resolved
src/lm.jl Outdated Show resolved Hide resolved
src/lm.jl Outdated Show resolved Hide resolved
test/runtests.jl Show resolved Hide resolved
@tbeason
Copy link

tbeason commented Mar 29, 2021

I'm afraid I won't be much help here! I am no expert, I just put my original PR together from the formulas I found online. I had tested it against the most common Python implementation and got identical results IIRC. I ended up not needing the functionality as I moved in a different direction, but I do think it is good to have something like this available somewhere.

Also, no need to cite me in the docstring, thanks for thinking of me though 😄

ericqu and others added 3 commits March 29, 2021 22:04
Copy link
Contributor Author

@ericqu ericqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests updated, awaiting further feedback.

The following points remains open as I do not know how to fix them.

  • It turns out that the "F test for model comparison" is broken, although I think it is unrelated as it was broken from the beginning.
  • and "NegativeBinomial NegativeBinomialLink Fixed θ " is broken as well but only for Julia nightly. It seems unrelated to the Cook's Distance.

@nalimilan
Copy link
Member

Yes, don't worry about the failures, they are due to unrelated printing changes in Julia 1.6.

Copy link
Member

@palday palday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely want cooksdistance to be a method of StatsBase.cooksdistance, so this requires that the compat entry for StatsBase is updated in Project.toml:

StatsBase = "0.33.5"

src/GLM.jl Show resolved Hide resolved
@nalimilan
Copy link
Member

Also need to require the latest StatsModels as test don't pass without it.

ericqu and others added 4 commits April 26, 2021 11:07
reverting removal
Co-authored-by: Milan Bouchet-Valat <[email protected]>
reflect dependency with StatsBase
add dependency on latest version of StatsModel
@ericqu ericqu requested review from nalimilan and palday April 26, 2021 09:17
src/lm.jl Outdated Show resolved Hide resolved
src/lm.jl Show resolved Hide resolved
@palday
Copy link
Member

palday commented Apr 26, 2021

@ericqu Sorry for the terse and incomplete reviews I'm taking quick looks during coffee breaks at the day job, so as annoying as many review rounds can be, I hope getting feedback sooner is nonetheless helpful.

implementing @palday suggestion
@ericqu
Copy link
Contributor Author

ericqu commented Apr 26, 2021

@palday thank you for taking the time to review and comment. I implemented the quick changes about the crossmodelmatrix and the project requirements.
Probably because of my lack of experience (this is my first PR), it is rather difficult for me to test implemented change locally: I used pkg> dev GLM, but this doesn't push back the changes to my own fork of GLM, I am not sure how to make changes to additional package(s) while being able to test the results locally before committing to PRs.

Copy link
Member

@palday palday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting very close. I'll let the other reviewers comment on style.

@nalimilan I'll add my QR decomposition trick in a follow-up PR. I can specialize even further for the choldpred to use the existing Cholesky decomposition instead of doing a new QR.

test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
src/lm.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
ericqu and others added 4 commits April 27, 2021 10:50
Co-authored-by: Phillip Alday <[email protected]>
Co-authored-by: Phillip Alday <[email protected]>
Co-authored-by: Phillip Alday <[email protected]>
added sources of values used for testing.
@ericqu ericqu requested a review from palday April 27, 2021 10:44
src/lm.jl Outdated Show resolved Hide resolved
Co-authored-by: Phillip Alday <[email protected]>
@ericqu
Copy link
Contributor Author

ericqu commented Apr 27, 2021

We definitely want cooksdistance to be a method of StatsBase.cooksdistance, so this requires that the compat entry for StatsBase is updated in Project.toml:

StatsBase = "0.33.5"

@palday or @nalimilan I made the change to address that. However, it keeps being presented as an issue for the PR. Is there anything that could be done?

@codecov-commenter
Copy link

codecov-commenter commented Apr 28, 2021

Codecov Report

Merging #415 (86b6a35) into master (2692f3c) will increase coverage by 2.73%.
The diff coverage is 89.47%.

❗ Current head 86b6a35 differs from pull request most recent head fc70f79. Consider uploading reports for the commit fc70f79 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #415      +/-   ##
==========================================
+ Coverage   81.08%   83.81%   +2.73%     
==========================================
  Files           7        6       -1     
  Lines         703      766      +63     
==========================================
+ Hits          570      642      +72     
+ Misses        133      124       -9     
Impacted Files Coverage Δ
src/linpred.jl 76.36% <75.00%> (+5.65%) ⬆️
src/lm.jl 94.82% <93.33%> (+1.49%) ⬆️
src/ftest.jl 98.52% <0.00%> (+0.06%) ⬆️
src/negbinfit.jl 82.19% <0.00%> (+0.50%) ⬆️
src/glmtools.jl 81.14% <0.00%> (+0.63%) ⬆️
src/glmfit.jl 80.14% <0.00%> (+3.69%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b39253f...fc70f79. Read the comment docs.

Copy link
Member

@palday palday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI failures seem unrelated.

I want to get that explicit matrix inversion changed as soon as possible, but have that flagged in an issue.

I'm happy when @nalimilan is happy.

src/lm.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
test/runtests.jl Outdated Show resolved Hide resolved
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@nalimilan nalimilan merged commit 1ea0456 into JuliaStats:master Apr 28, 2021
@nalimilan nalimilan mentioned this pull request May 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants