From 1972616aa64c3513fdf565b50be938a5b2219f5d Mon Sep 17 00:00:00 2001 From: George Ho <19851673+eigenfoo@users.noreply.github.com> Date: Fri, 28 Jun 2019 13:59:58 +0800 Subject: [PATCH] ENH: onboard lindeloevs feedback --- index.html | 54 +++++++++++++++++----------------- tests-as-linear.ipynb | 68 +++++++++++++++++++++---------------------- 2 files changed, 61 insertions(+), 61 deletions(-) diff --git a/index.html b/index.html index 9509c82..93e33df 100644 --- a/index.html +++ b/index.html @@ -13167,7 +13167,7 @@

Common stat
-

Last updated: June 27, 2019

+

Last updated: June 28, 2019

@@ -13439,7 +13439,7 @@

3.0.2 Theory: rank-transformation

3.0.3 Python code: Pearson correlation

It couldn't be much simpler to run these models with statsmodels (smf.ols) or scipy (scipy.stats.pearson). They yield identical slopes, p and t values, but there's a catch: smf.ols gives you the slope and even though that is usually much more interpretable and informative than the correlation coefficient $r$, you may still want $r$. Luckily, the slope becomes $r$ if x and y have a standard deviation of exactly 1. You can do this by scaling the data: data /= data.std().

-

Notice how scipy.stats.pearsonr and smf.ols (scaled) have the same slopes, $p$ and $t$ values.

+

Notice how scipy.stats.pearsonr and smf.ols (scaled) have the same slopes, $p$ and $t$ values. Also note that statistical functions from scipy.stats do not provide confidence intervals, while performing the linear regression with smf.ols does.

@@ -13451,7 +13451,7 @@

3.0.3 Python code: Pearson corre
correlated = pd.DataFrame()
 correlated["x"] = np.linspace(0, 1)
-correlated["y"] = 5 * correlated.x + 2 * np.random.randn(len(correlated.x))
+correlated["y"] = 1.5 * correlated.x + 2 * np.random.randn(len(correlated.x))
 
 scaled = correlated / correlated.std()
 
@@ -13518,27 +13518,27 @@ 

3.0.3 Python code: Pearson corre scipy.stats.pearsonr - 0.649620 - 3.321709e-07 + 0.249694 + 0.080332 NaN NaN NaN smf.ols - 5.012744 - 3.321709e-07 - 5.91995 - 3.310230 - 6.715258 + 1.512744 + 0.080332 + 1.78652 + -0.189770 + 3.215258 smf.ols (scaled) - 0.649620 - 3.321709e-07 - 5.91995 - 0.428985 - 0.870255 + 0.249694 + 0.080332 + 1.78652 + -0.031324 + 0.530712 @@ -13628,19 +13628,19 @@

3.0.4 Python code: Spearman cor scipy.stats.spearmanr - 0.634958 - 7.322277e-07 + 0.233421 + 0.102803 NaN NaN NaN smf.ols (ranked) - 0.634958 - 7.322277e-07 - 5.694307 - 0.410757 - 0.859159 + 0.233421 + 0.102803 + 1.663134 + -0.048772 + 0.515615 @@ -15234,7 +15234,7 @@

6.2 Two-way ANOVA

6.2.2 Python code: Two-way ANOVA

Note on Python port: - Unfortunately, scipy.stats does not have any function to perform a two-way ANOVA, so we can't verify that the linear model gives the same results as some other Python statistical function. Nevertheless, we'll go through the motions of performing the linear regression. + Unfortunately, scipy.stats does not have a dedicated function to perform two-way ANOVA, so we cannot demonstrate directly that it is fundamentally a linear model. Nevertheless, we will write the code to perform the linear regression.

@@ -15326,7 +15326,7 @@

6.3 ANCOVA
Note on Python port: - Unfortunately, scipy.stats does not have any function to perform ANCOVA, so again, we can't verify that the linear model gives the same results as some other Python statistical function. Nevertheless, we'll go through the motions of performing the linear regression. + Unfortunately, scipy.stats does not have a dedicated function to perform ANCOVA, so again, we cannot demonstrate directly that it is fundamentally a linear model. Nevertheless, we will write the code to perform the linear regression.
@@ -15442,7 +15442,7 @@

7.1.3 Python code: Goodness of fitNote that smf.ols does not support GLMs: we need to use sm.GLM. While sm.GLM does not have a patsy-formula interface, we can still use patsy.dmatrices to get the endog and exog design matrices, and then feed that into sm.GLM.

Note on Python port: - Unfortunately, statsmodels does not currently support performing a one-way ANOVA test on GLMs (the anova_lm function only works for linear models), so while we can perform the GLM, there is no support for computing the F-statistic or its p-value. Nevertheless, we'll go through the motions of performing the generalized linear regression. + Unfortunately, statsmodels does not currently support performing a one-way ANOVA test on GLMs (the anova_lm function only works for linear models), so while we can perform the GLM, there is no support for computing the F-statistic or its p-value. Nevertheless, we will write the code to perform the generalized linear regression.
@@ -15738,7 +15738,7 @@

10 Limitationsrobust models would be preferable, but fail to show the equivalences.

-
  • Several named tests are still missing from the list and may be added at a later time. This includes the Sign test (require large N to be reasonably approximated by a linear model), Friedman as RM-ANOVA on rank(y), McNemar, and Binomial/Multinomial. See stuff on these in the section on links to further equivalences. If you think that they should be included here, feel free to submit "solutions" to the GitHub repo of this doc!

    +
  • Several named tests are still missing from the list and may be added at a later time. This includes the Sign test (require large N to be reasonably approximated by a linear model), Friedman as RM-ANOVA on rank(y), McNemar, and Binomial/Multinomial. See stuff on these in the section on links to further equivalences. If you think that they should be included here, feel free to submit "solutions" to the GitHub repo of this doc!

  • @@ -15749,7 +15749,7 @@

    10 Limitations

    11 License

    Creative Commons License

    -

    Common statistical tests are linear models: Python port by Jonas Kristoffer Lindeløv and George Ho is licensed under a Creative Commons Attribution 4.0 International License.

    +

    Common statistical tests are linear models: Python port by George Ho and Jonas Kristoffer Lindeløv is licensed under a Creative Commons Attribution 4.0 International License.

    Based on a work at https://lindeloev.github.io/tests-as-linear/.

    Permissions beyond the scope of this license may be available at https://github.com/eigenfoo/tests-as-linear.

    diff --git a/tests-as-linear.ipynb b/tests-as-linear.ipynb index 37d321d..f3c74fa 100644 --- a/tests-as-linear.ipynb +++ b/tests-as-linear.ipynb @@ -21,7 +21,7 @@ { "data": { "text/markdown": [ - "Last updated: June 27, 2019" + "Last updated: June 28, 2019" ], "text/plain": [ "" @@ -266,7 +266,7 @@ "\n", "It couldn't be much simpler to run these models with `statsmodels` ([`smf.ols`](https://www.statsmodels.org/stable/example_formulas.html#ols-regression-using-formulas)) or `scipy` ([`scipy.stats.pearson`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html)). They yield identical slopes, `p` and `t` values, but there's a catch: `smf.ols` gives you the *slope* and even though that is usually much more interpretable and informative than the _correlation coefficient_ $r$, you may still want $r$. Luckily, the slope becomes $r$ if `x` and `y` have a standard deviation of exactly 1. You can do this by scaling the data: `data /= data.std()`.\n", "\n", - "Notice how `scipy.stats.pearsonr` and `smf.ols (scaled)` have the same slopes, $p$ and $t$ values." + "Notice how `scipy.stats.pearsonr` and `smf.ols (scaled)` have the same slopes, $p$ and $t$ values. Also note that statistical functions from `scipy.stats` do not provide confidence intervals, while performing the linear regression with `smf.ols` does." ] }, { @@ -277,7 +277,7 @@ "source": [ "correlated = pd.DataFrame()\n", "correlated[\"x\"] = np.linspace(0, 1)\n", - "correlated[\"y\"] = 5 * correlated.x + 2 * np.random.randn(len(correlated.x))\n", + "correlated[\"y\"] = 1.5 * correlated.x + 2 * np.random.randn(len(correlated.x))\n", "\n", "scaled = correlated / correlated.std()\n", "\n", @@ -324,37 +324,37 @@ " \n", " \n", " scipy.stats.pearsonr\n", - " 0.649620\n", - " 3.321709e-07\n", + " 0.249694\n", + " 0.080332\n", " NaN\n", " NaN\n", " NaN\n", " \n", " \n", " smf.ols\n", - " 5.012744\n", - " 3.321709e-07\n", - " 5.91995\n", - " 3.310230\n", - " 6.715258\n", + " 1.512744\n", + " 0.080332\n", + " 1.78652\n", + " -0.189770\n", + " 3.215258\n", " \n", " \n", " smf.ols (scaled)\n", - " 0.649620\n", - " 3.321709e-07\n", - " 5.91995\n", - " 0.428985\n", - " 0.870255\n", + " 0.249694\n", + " 0.080332\n", + " 1.78652\n", + " -0.031324\n", + " 0.530712\n", " \n", " \n", "\n", "
    " ], "text/plain": [ - " value p-values t-values 0.025 CI 0.975 CI\n", - "scipy.stats.pearsonr 0.649620 3.321709e-07 NaN NaN NaN\n", - "smf.ols 5.012744 3.321709e-07 5.91995 3.310230 6.715258\n", - "smf.ols (scaled) 0.649620 3.321709e-07 5.91995 0.428985 0.870255" + " value p-values t-values 0.025 CI 0.975 CI\n", + "scipy.stats.pearsonr 0.249694 0.080332 NaN NaN NaN\n", + "smf.ols 1.512744 0.080332 1.78652 -0.189770 3.215258\n", + "smf.ols (scaled) 0.249694 0.080332 1.78652 -0.031324 0.530712" ] }, "execution_count": 8, @@ -427,28 +427,28 @@ " \n", " \n", " scipy.stats.spearmanr\n", - " 0.634958\n", - " 7.322277e-07\n", + " 0.233421\n", + " 0.102803\n", " NaN\n", " NaN\n", " NaN\n", " \n", " \n", " smf.ols (ranked)\n", - " 0.634958\n", - " 7.322277e-07\n", - " 5.694307\n", - " 0.410757\n", - " 0.859159\n", + " 0.233421\n", + " 0.102803\n", + " 1.663134\n", + " -0.048772\n", + " 0.515615\n", " \n", " \n", "\n", "" ], "text/plain": [ - " value p-values t-values 0.025 CI 0.975 CI\n", - "scipy.stats.spearmanr 0.634958 7.322277e-07 NaN NaN NaN\n", - "smf.ols (ranked) 0.634958 7.322277e-07 5.694307 0.410757 0.859159" + " value p-values t-values 0.025 CI 0.975 CI\n", + "scipy.stats.spearmanr 0.233421 0.102803 NaN NaN NaN\n", + "smf.ols (ranked) 0.233421 0.102803 1.663134 -0.048772 0.515615" ] }, "execution_count": 10, @@ -1941,7 +1941,7 @@ "\n", "
    \n", " Note on Python port:\n", - " Unfortunately, scipy.stats does not have any function to perform a two-way ANOVA, so we can't verify that the linear model gives the same results as some other Python statistical function. Nevertheless, we'll go through the motions of performing the linear regression.\n", + " Unfortunately, scipy.stats does not have a dedicated function to perform two-way ANOVA, so we cannot demonstrate directly that it is fundamentally a linear model. Nevertheless, we will write the code to perform the linear regression.\n", "
    " ] }, @@ -2014,7 +2014,7 @@ "source": [ "
    \n", " Note on Python port:\n", - " Unfortunately, scipy.stats does not have any function to perform ANCOVA, so again, we can't verify that the linear model gives the same results as some other Python statistical function. Nevertheless, we'll go through the motions of performing the linear regression.\n", + " Unfortunately, scipy.stats does not have a dedicated function to perform ANCOVA, so again, we cannot demonstrate directly that it is fundamentally a linear model. Nevertheless, we will write the code to perform the linear regression.\n", "
    " ] }, @@ -2135,7 +2135,7 @@ "\n", "
    \n", " Note on Python port:\n", - " Unfortunately, statsmodels does not currently support performing a one-way ANOVA test on GLMs (the anova_lm function only works for linear models), so while we can perform the GLM, there is no support for computing the F-statistic or its p-value. Nevertheless, we'll go through the motions of performing the generalized linear regression.\n", + " Unfortunately, statsmodels does not currently support performing a one-way ANOVA test on GLMs (the anova_lm function only works for linear models), so while we can perform the GLM, there is no support for computing the F-statistic or its p-value. Nevertheless, we will write the code to perform the generalized linear regression.\n", "
    " ] }, @@ -2429,7 +2429,7 @@ "\n", "3. I have not discussed inference. I am only including p-values in the comparisons as a crude way to show the equivalences between the underlying models since people care about p-values. Parameter estimates will show the same equivalence. How to do *inference* is another matter. Personally, I'm a Bayesian, but going Bayesian here would render it less accessible to the wider audience. Also, doing [robust models](https://en.wikipedia.org/wiki/Robust_statistics) would be preferable, but fail to show the equivalences.\n", "\n", - "4. Several named tests are still missing from the list and may be added at a later time. This includes the Sign test (require large N to be reasonably approximated by a linear model), Friedman as RM-ANOVA on `rank(y)`, McNemar, and Binomial/Multinomial. See stuff on these in [the section on links to further equivalences](#8-Sources-and-further-equivalences). If you think that they should be included here, feel free to submit \"solutions\" to [the GitHub repo](https://github.com/lindeloev/tests-as-linear/) of this doc!" + "4. Several named tests are still missing from the list and may be added at a later time. This includes the Sign test (require large N to be reasonably approximated by a linear model), Friedman as RM-ANOVA on `rank(y)`, McNemar, and Binomial/Multinomial. See stuff on these in [the section on links to further equivalences](#8-Sources-and-further-equivalences). If you think that they should be included here, feel free to submit \"solutions\" to [the GitHub repo](https://github.com/eigenfoo/tests-as-linear/) of this doc!" ] }, { @@ -2440,7 +2440,7 @@ "\n", "\"Creative\n", "\n", - "_Common statistical tests are linear models_: Python port by [Jonas Kristoffer Lindeløv and George Ho](https://eigenfoo.xyz/tests-as-linear/) is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).\n", + "_Common statistical tests are linear models_: Python port by [George Ho and Jonas Kristoffer Lindeløv](https://eigenfoo.xyz/tests-as-linear/) is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).\n", "\n", "Based on a work at https://lindeloev.github.io/tests-as-linear/.\n", "\n",