Cooks Distance tests #415

ericqu · 2021-03-26T11:47:00Z

I added some tests, and an export. and use pretty much the original code.

I wanted to try to do a PR and learn something along the way.

Unfortunately, there are a few errors, I am hoping you can help me with it.

I did dev GLM in the package manager, and then tried test GLM but got some errors in the relevant test:

the first error is

r2(t_lm) = 0.9532076162447004
Linear Model Cooks Distance - refers to PR #368 and issue #414: Error During Test at /home/eric/.julia/dev/GLM/test/runtests.jl:63
  Test threw exception
  Expression: isapprox(st_df.CooksD, cooksdistance(t_lm))
  MethodError: no method matching cooksdistance(::StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}})
  Closest candidates are:
    cooksdistance(!Matched::LinearModel) at /home/eric/.julia/dev/GLM/src/lm.jl:251
  Stacktrace:
   [1] top-level scope at /home/eric/.julia/dev/GLM/test/runtests.jl:63
   [2] top-level scope at /home/eric/Repos/julia_install/julia-1.5.3/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
   [3] top-level scope at /home/eric/.julia/dev/GLM/test/runtests.jl:53

Which I believe is a problem with the function signature.
I added a call to r2 as it also use a obj::LinearModel as an argument and does not throw an error.
I thought about not forcing a type, but I guessed we want to make sure this a linear model and a generalized one.

Hence, I think I am missing something obvious although not sure what yet.

The second error is:

r2(t_lm_w) = 0.9525895777180596
Linear Model Cooks Distance - refers to PR #368 and issue #414: Error During Test at /home/eric/.julia/dev/GLM/test/runtests.jl:52
  Got exception outside of a @test
  DomainError with -1.8726905285161664:
  sqrt will only return a complex result if called with a complex argument. Try sqrt(Complex(x)).
  Stacktrace:
   [1] throw_complex_domainerror(::Symbol, ::Float64) at ./math.jl:33
   [2] sqrt at ./math.jl:573 [inlined]
   [3] _broadcast_getindex_evalf at ./broadcast.jl:648 [inlined]
   [4] _broadcast_getindex at ./broadcast.jl:621 [inlined]
   [5] getindex at ./broadcast.jl:575 [inlined]
   [6] macro expansion at ./broadcast.jl:932 [inlined]
   [7] macro expansion at ./simdloop.jl:77 [inlined]
   [8] copyto! at ./broadcast.jl:931 [inlined]
   [9] copyto! at ./broadcast.jl:886 [inlined]
   [10] copy at ./broadcast.jl:862 [inlined]
   [11] materialize at ./broadcast.jl:837 [inlined]
   [12] stderror(::LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}}) at /home/eric/.julia/dev/GLM/src/linpred.jl:224
   [13] coeftable(::LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}}; level::Float64) at /home/eric/.julia/dev/GLM/src/lm.jl:195
   [14] coeftable(::LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}}) at /home/eric/.julia/dev/GLM/src/lm.jl:194
   [15] coeftable(::StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/eric/.julia/packages/StatsModels/FJnLj/src/statsmodel.jl:174
   [16] coeftable at /home/eric/.julia/packages/StatsModels/FJnLj/src/statsmodel.jl:174 [inlined]
   [17] show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}) at /home/eric/.julia/packages/StatsModels/FJnLj/src/statsmodel.jl:185
   [18] show_delim_array(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::Tuple{StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}}, ::Char, ::Char, ::Char, ::Bool, ::Int64, ::Int64) at ./show.jl:776
   [19] show_delim_array at ./show.jl:761 [inlined]
   [20] show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::Tuple{StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,CholeskyPivoted{Float64,Array{Float64,2}}}},Array{Float64,2}}}) at ./show.jl:794
   [21] _show_default(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::Any) at ./show.jl:406
   [22] show_default at ./show.jl:389 [inlined]
   [23] show(::IOContext{Base.GenericIOBuffer{Array{UInt8,1}}}, ::Any) at ./show.jl:384
   [24] sprint(::Function, ::MethodError; context::Pair{Symbol,Bool}, sizehint::Int64) at ./strings/io.jl:103
   [25] Test.Error(::Any, ::Any, ::Any, ::Any, ::Any) at /home/eric/Repos/julia_install/julia-1.5.3/share/julia/stdlib/v1.5/Test/src/Test.jl:162
   [26] do_test(::Test.ExecutionResult, ::Any) at /home/eric/Repos/julia_install/julia-1.5.3/share/julia/stdlib/v1.5/Test/src/Test.jl:518
   [27] top-level scope at /home/eric/.julia/dev/GLM/test/runtests.jl:66
   [28] top-level scope at /home/eric/Repos/julia_install/julia-1.5.3/share/julia/stdlib/v1.5/Test/src/Test.jl:1115
   [29] top-level scope at /home/eric/.julia/dev/GLM/test/runtests.jl:53
   [30] include(::String) at ./client.jl:457
   [31] top-level scope at none:6
   [32] eval(::Module, ::Any) at ./boot.jl:331
   [33] exec_options(::Base.JLOptions) at ./client.jl:272
   [34] _start() at ./client.jl:506

here it seems that when used with weights when calling the stderrror ( stderror(x::LinPredModel) = sqrt.(diag(vcov(x))) )I suppose some entries in the diag vcov are negative. Could it be related to the weights?

ericqu · 2021-03-26T12:04:56Z

For reference, I used the following program to generate the test data with SAS.

proc reg data=work.cook_test;
	model Y = XA XB / vif ;
	output out=rout_cd_col P=Pred cookd=CooksD;
run;

proc reg data=work.cook_test;
	model Y = XA ;
	output out=rout_cd_col P=Pred cookd=CooksD;
run;

proc reg data=work.cook_test;
	model Y = XA / noint ;
	output out=rout_cd_col P=Pred cookd=CooksD;
run;

ericqu · 2021-03-26T15:38:19Z

So looking further into the first issue.
It appears that when calling the lm function to do the fit it eventually calls fit in StatsModels which in turn will redirect calls to r2 and adjr2 to the implementation in LinearModel when adequate (which explains why my call to r2 was working).
This would suggest that to implement the cooksdistance (or any additional methods) there would be a needs to make some changes in StatsModels as well. While this seems useful for some "generic" functions, for specific ones it would seem useful to short-circuit this. And I am not sure how to that adequately.

ericqu · 2021-03-26T16:36:16Z

While adding the following lines and alias method in GLM/lm.jl:

using StatsModels

function cooksdistance(obj :: StatsModels.TableRegressionModel)
    cooksdistance(obj.model)
end

manages to get rid of the problems. I am not sure this is the way to go as it introduces a dependency on StatsModel, or is it alright?

For the weighted version, it appears that the results differ between SAS and the implementation. I will look into it.

kleinschmidt · 2021-03-26T20:26:14Z

Yeah this is a wart of the way that GLM integrates with StatsModels at the moment. See JuliaStats/StatsModels.jl#32 for some discussion. The short version is that GLM doesn't at the moment know anything about the formula syntax (y ~ 1 + x), it only knows about a y vector and an X matrix. The original idea way back in the day was that StatsModels (formerly part of DataFrames) would hijack the fit(::RegressionModel, ::Formula, ::DataFrame, ...) methods, generate the arrays needed to fit the model, and wrap the whole thing in its own struct (now TableRegressionModel). The hope was that you could The now-obvious shortcoming of that method is the one you're seeing now: only functions which are "blessed" in StatsModels themselves work with that wrapper (see here).

In the long term teh plan is to get rid of that altogether and have packages liek GLM take a dependency on StatsModels, but I'm afraid that keeps getting pushed back because there are some design issues. See #339 for @nalimilan's progress in re-working GLM along those lines.

I'm afraid that it's not totally clear to me where that leaves otherwise good contributions like this. I'd hate to see this PR languish for these upstream reasons but I don't see a way around it at this point.

ericqu · 2021-03-27T00:52:00Z

Okay, that is disappointing.

I will still look into the Cook's distance with the weights scenario to complete this task and amend the PR accordingly. This is my first PR, so you might see some weird issues such as .vscode/settings.json which I will fix once I know more about GitHub.

I am struck that so much effort is spent establishing an architecture supporting different models. To take the example of the formula discussed in the above conversation. It matters little to me as a user if I use a different formula syntax between GLM LM or MixedModel (for which I will probably still need to do something special to define the fixed and random effects). For instance, the syntax in SAS/Proc REG (see above) is different, but that is a shallow learning effort to use one coming from the other. In contrast to not having features (such as the Cook's distance) that will dictate which tools I use.
While I understand that there needs to be a tradeoff between the inefficiencies resulting from completely independent models (say code duplication) and needed architecture to alleviate their costs, from my (user) perspective, I balance that with the features that interested parties could develop with a lower barrier of entry.

Maybe the MixedModels guys have found solutions?
I see in (https://github.com/JuliaStats/MixedModels.jl/blob/main/src/linearmixedmodel.jl) the following is using a similar signature and is not part of StatsModel.:

"""
    feL(m::LinearMixedModel)
Return the lower Cholesky factor for the fixed-effects parameters, as an `LowerTriangular`
`p × p` matrix.
"""
function feL(m::LinearMixedModel)
    XyL = m.L[end]
    k = size(XyL, 1)
    inds = Base.OneTo(k - 1)
    LowerTriangular(view(XyL, inds, inds))
end

I will keep looking further.

nalimilan · 2021-03-27T11:15:14Z

I'm afraid that it's not totally clear to me where that leaves otherwise good contributions like this. I'd hate to see this PR languish for these upstream reasons but I don't see a way around it at this point.

Why not just add cooksdistance to the list of methods that StatsModels delegate to the underlying model? We can't stop adding features waiting for the switch to the new framework, given the absence of progress on that front.

ericqu · 2021-03-29T16:01:42Z

Happy Monday everyone!

So I made some changes:

I added a wrapper so that when called with a StatsModels.TableRegressionModel then use the LinearModel.
I tested with the results from SAS and R for the Cook's Distance with weights, and unfortunately, the existing code did not give accurate results. It turns out that calculating the Cooks's Distance when the weights are normalized is relatively simple as only the residuals need to be weighted.
I added only a couple of tests so that the testing time isn't too long (I could add some more if you want to).
It turns out that the "F test for model comparison" is broken, although I think it is unrelated as it was broken from the beginning.
and "NegativeBinomial NegativeBinomialLink Fixed θ " is broken as well but only for Julia nightly. It seems unrelated to the Cook's Distance.
I am not sure but I may have added a .gitignore file, let me know if I should delete it.

Hence I think this is good to go. Please let me know your comments or if you want me to change something.
I struggled with git/GitHub; it seems ok to me now. If it isn't, please let me know as well.

src/lm.jl

test/runtests.jl

tbeason · 2021-03-29T18:17:57Z

I'm afraid I won't be much help here! I am no expert, I just put my original PR together from the formulas I found online. I had tested it against the most common Python implementation and got identical results IIRC. I ended up not needing the functionality as I moved in a different direction, but I do think it is good to have something like this available somewhere.

Also, no need to cite me in the docstring, thanks for thinking of me though 😄

Co-authored-by: Milan Bouchet-Valat <[email protected]>

only on my local machine

ericqu

Tests updated, awaiting further feedback.

The following points remains open as I do not know how to fix them.

It turns out that the "F test for model comparison" is broken, although I think it is unrelated as it was broken from the beginning.
and "NegativeBinomial NegativeBinomialLink Fixed θ " is broken as well but only for Julia nightly. It seems unrelated to the Cook's Distance.

nalimilan · 2021-04-04T15:04:56Z

Yes, don't worry about the failures, they are due to unrelated printing changes in Julia 1.6.

palday

We definitely want cooksdistance to be a method of StatsBase.cooksdistance, so this requires that the compat entry for StatsBase is updated in Project.toml:

StatsBase = "0.33.5"

src/GLM.jl

nalimilan · 2021-04-24T20:10:48Z

Also need to require the latest StatsModels as test don't pass without it.

reverting removal

Co-authored-by: Milan Bouchet-Valat <[email protected]>

reflect dependency with StatsBase

add dependency on latest version of StatsModel

src/lm.jl

palday · 2021-04-26T14:48:50Z

@ericqu Sorry for the terse and incomplete reviews I'm taking quick looks during coffee breaks at the day job, so as annoying as many review rounds can be, I hope getting feedback sooner is nonetheless helpful.

@palday

implementing @palday suggestion

ericqu · 2021-04-26T15:38:40Z

@palday thank you for taking the time to review and comment. I implemented the quick changes about the crossmodelmatrix and the project requirements.
Probably because of my lack of experience (this is my first PR), it is rather difficult for me to test implemented change locally: I used pkg> dev GLM, but this doesn't push back the changes to my own fork of GLM, I am not sure how to make changes to additional package(s) while being able to test the results locally before committing to PRs.

palday

Getting very close. I'll let the other reviewers comment on style.

@nalimilan I'll add my QR decomposition trick in a follow-up PR. I can specialize even further for the choldpred to use the existing Cholesky decomposition instead of doing a new QR.

test/runtests.jl

src/lm.jl

test/runtests.jl

Co-authored-by: Phillip Alday <[email protected]>

added sources of values used for testing.

src/lm.jl

Co-authored-by: Phillip Alday <[email protected]>

ericqu · 2021-04-27T11:27:58Z

We definitely want cooksdistance to be a method of StatsBase.cooksdistance, so this requires that the compat entry for StatsBase is updated in Project.toml:
StatsBase = "0.33.5"

@palday or @nalimilan I made the change to address that. However, it keeps being presented as an issue for the PR. Is there anything that could be done?

codecov-commenter · 2021-04-28T09:14:11Z

Codecov Report

Merging #415 (86b6a35) into master (2692f3c) will increase coverage by 2.73%.
The diff coverage is 89.47%.

❗ Current head 86b6a35 differs from pull request most recent head fc70f79. Consider uploading reports for the commit fc70f79 to get more accurate results

@@            Coverage Diff             @@
##           master     #415      +/-   ##
==========================================
+ Coverage   81.08%   83.81%   +2.73%     
==========================================
  Files           7        6       -1     
  Lines         703      766      +63     
==========================================
+ Hits          570      642      +72     
+ Misses        133      124       -9

Impacted Files	Coverage Δ
src/linpred.jl	`76.36% <75.00%> (+5.65%)`	⬆️
src/lm.jl	`94.82% <93.33%> (+1.49%)`	⬆️
src/ftest.jl	`98.52% <0.00%> (+0.06%)`	⬆️
src/negbinfit.jl	`82.19% <0.00%> (+0.50%)`	⬆️
src/glmtools.jl	`81.14% <0.00%> (+0.63%)`	⬆️
src/glmfit.jl	`80.14% <0.00%> (+3.69%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b39253f...fc70f79. Read the comment docs.

palday

CI failures seem unrelated.

I want to get that explicit matrix inversion changed as soon as possible, but have that flagged in an issue.

I'm happy when @nalimilan is happy.

src/lm.jl

test/runtests.jl

nalimilan

Thanks!

Cooks Distance tests

c4ba123

ericqu marked this pull request as draft March 26, 2021 13:35

ericqu mentioned this pull request Mar 26, 2021

Cooks distance #414

Closed

updated to couple with StatsModels

f66d1d7

ararslan requested review from kleinschmidt and nalimilan March 26, 2021 18:08

ericqu added 3 commits March 27, 2021 00:49

using StatsModels etc.

b680e2a

no weights

deb4f38

no weights

6d13db3

ericqu added 4 commits March 29, 2021 17:25

added normalised weights support

1ec1d71

added reference

fec1415

Merge branch 'master' of https://github.com/ericqu/GLM.jl

1582c18

add docstring to wrapper

0c3c50f

ericqu marked this pull request as ready for review March 29, 2021 16:01

nalimilan reviewed Mar 29, 2021

View reviewed changes

src/lm.jl Outdated Show resolved Hide resolved

nalimilan reviewed Mar 29, 2021

View reviewed changes

src/lm.jl Outdated Show resolved Hide resolved

nalimilan reviewed Mar 29, 2021

View reviewed changes

src/lm.jl Outdated Show resolved Hide resolved

src/lm.jl Outdated Show resolved Hide resolved

src/lm.jl Outdated Show resolved Hide resolved

test/runtests.jl Show resolved Hide resolved

ericqu and others added 3 commits March 29, 2021 22:04

Update src/lm.jl

463aeb1

Co-authored-by: Milan Bouchet-Valat <[email protected]>

base, no intercept and collinear test amended

253c503

Delete .gitignore

66d1b1e

only on my local machine

ericqu commented Apr 3, 2021

View reviewed changes

palday requested changes Apr 24, 2021

View reviewed changes

nalimilan reviewed Apr 24, 2021

View reviewed changes

src/GLM.jl Show resolved Hide resolved

ericqu and others added 4 commits April 26, 2021 11:07

Create .gitignore

6a57a34

reverting removal

Update src/lm.jl

58f9f38

Co-authored-by: Milan Bouchet-Valat <[email protected]>

Update Project.toml

d503aa4

reflect dependency with StatsBase

Update Project.toml

6d0fa18

add dependency on latest version of StatsModel

ericqu requested review from nalimilan and palday April 26, 2021 09:17

palday reviewed Apr 26, 2021

View reviewed changes

src/lm.jl Outdated Show resolved Hide resolved

src/lm.jl Show resolved Hide resolved

Update lm.jl

b1ab4d3

implementing @palday suggestion

palday reviewed Apr 26, 2021

View reviewed changes

test/runtests.jl Outdated Show resolved Hide resolved

test/runtests.jl Outdated Show resolved Hide resolved

src/lm.jl Outdated Show resolved Hide resolved

test/runtests.jl Outdated Show resolved Hide resolved

ericqu and others added 4 commits April 27, 2021 10:50

Update test/runtests.jl

2c3ac21

Co-authored-by: Phillip Alday <[email protected]>

Update src/lm.jl

e49f88f

Co-authored-by: Phillip Alday <[email protected]>

Update test/runtests.jl

b0fb18b

Co-authored-by: Phillip Alday <[email protected]>

add test values provenance.

0a4581f

added sources of values used for testing.

ericqu requested a review from palday April 27, 2021 10:44

palday reviewed Apr 27, 2021

View reviewed changes

src/lm.jl Outdated Show resolved Hide resolved

Update src/lm.jl

86b6a35

Co-authored-by: Phillip Alday <[email protected]>

palday mentioned this pull request Apr 28, 2021

Optimize Cook's Distance to avoid explicit matrix inverstion #427

Open

3 tasks

palday approved these changes Apr 28, 2021

View reviewed changes

nalimilan reviewed Apr 28, 2021

View reviewed changes

src/lm.jl Outdated Show resolved Hide resolved

test/runtests.jl Outdated Show resolved Hide resolved

test/runtests.jl Outdated Show resolved Hide resolved

test/runtests.jl Outdated Show resolved Hide resolved

Apply suggestions from code review

fc70f79

nalimilan approved these changes Apr 28, 2021

View reviewed changes

nalimilan merged commit 1ea0456 into JuliaStats:master Apr 28, 2021

nalimilan mentioned this pull request May 2, 2021

Add VIF? #428

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cooks Distance tests #415

Cooks Distance tests #415

ericqu commented Mar 26, 2021 •

edited

Loading

ericqu commented Mar 26, 2021 •

edited

Loading

ericqu commented Mar 26, 2021

ericqu commented Mar 26, 2021 •

edited

Loading

kleinschmidt commented Mar 26, 2021

ericqu commented Mar 27, 2021

nalimilan commented Mar 27, 2021

ericqu commented Mar 29, 2021 •

edited

Loading

tbeason commented Mar 29, 2021

ericqu left a comment •

edited

Loading

nalimilan commented Apr 4, 2021

palday left a comment

nalimilan commented Apr 24, 2021

palday commented Apr 26, 2021

ericqu commented Apr 26, 2021 •

edited

Loading

palday left a comment

ericqu commented Apr 27, 2021

codecov-commenter commented Apr 28, 2021 •

edited

Loading

palday left a comment

nalimilan left a comment

Cooks Distance tests #415

Cooks Distance tests #415

Conversation

ericqu commented Mar 26, 2021 • edited Loading

ericqu commented Mar 26, 2021 • edited Loading

ericqu commented Mar 26, 2021

ericqu commented Mar 26, 2021 • edited Loading

kleinschmidt commented Mar 26, 2021

ericqu commented Mar 27, 2021

nalimilan commented Mar 27, 2021

ericqu commented Mar 29, 2021 • edited Loading

tbeason commented Mar 29, 2021

ericqu left a comment • edited Loading

Choose a reason for hiding this comment

nalimilan commented Apr 4, 2021

palday left a comment

Choose a reason for hiding this comment

nalimilan commented Apr 24, 2021

palday commented Apr 26, 2021

ericqu commented Apr 26, 2021 • edited Loading

palday left a comment

Choose a reason for hiding this comment

ericqu commented Apr 27, 2021

codecov-commenter commented Apr 28, 2021 • edited Loading

Codecov Report

palday left a comment

Choose a reason for hiding this comment

nalimilan left a comment

Choose a reason for hiding this comment

ericqu commented Mar 26, 2021 •

edited

Loading

ericqu commented Mar 26, 2021 •

edited

Loading

ericqu commented Mar 26, 2021 •

edited

Loading

ericqu commented Mar 29, 2021 •

edited

Loading

ericqu left a comment •

edited

Loading

ericqu commented Apr 26, 2021 •

edited

Loading

codecov-commenter commented Apr 28, 2021 •

edited

Loading