From 1972616aa64c3513fdf565b50be938a5b2219f5d Mon Sep 17 00:00:00 2001
From: George Ho <19851673+eigenfoo@users.noreply.github.com>
Date: Fri, 28 Jun 2019 13:59:58 +0800
Subject: [PATCH] ENH: onboard lindeloevs feedback

---
 index.html            | 54 +++++++++++++++++-----------------
 tests-as-linear.ipynb | 68 +++++++++++++++++++++----------------------
 2 files changed, 61 insertions(+), 61 deletions(-)
diff --git a/index.html b/index.html
index 9509c82..93e33df 100644
--- a/index.html
+++ b/index.html
@@ -13167,7 +13167,7 @@ <h1 id="Common-statistical-tests-are-linear-models:-Python-port"><em>Common stat
 
 
 <div class="output_markdown rendered_html output_subarea ">
-<p>Last updated: June 27, 2019</p>
+<p>Last updated: June 28, 2019</p>
 
 </div>
 
@@ -13439,7 +13439,7 @@ <h3 id="3.0.2-Theory:-rank-transformation">3.0.2 Theory: rank-transformation<a c
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h3 id="3.0.3-Python-code:-Pearson-correlation">3.0.3 Python code: Pearson correlation<a class="anchor-link" href="#3.0.3-Python-code:-Pearson-correlation">&#182;</a></h3><p>It couldn't be much simpler to run these models with <code>statsmodels</code> (<a href="https://www.statsmodels.org/stable/example_formulas.html#ols-regression-using-formulas"><code>smf.ols</code></a>) or <code>scipy</code> (<a href="https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html"><code>scipy.stats.pearson</code></a>). They yield identical slopes, <code>p</code> and <code>t</code> values, but there's a catch: <code>smf.ols</code> gives you the <em>slope</em> and even though that is usually much more interpretable and informative than the <em>correlation coefficient</em> $r$, you may still want $r$. Luckily, the slope becomes $r$ if <code>x</code> and <code>y</code> have a standard deviation of exactly 1. You can do this by scaling the data: <code>data /= data.std()</code>.</p>
-<p>Notice how <code>scipy.stats.pearsonr</code> and <code>smf.ols (scaled)</code> have the same slopes, $p$ and $t$ values.</p>
+<p>Notice how <code>scipy.stats.pearsonr</code> and <code>smf.ols (scaled)</code> have the same slopes, $p$ and $t$ values. Also note that statistical functions from <code>scipy.stats</code> do not provide confidence intervals, while performing the linear regression with <code>smf.ols</code> does.</p>
 
 </div>
 </div>
@@ -13451,7 +13451,7 @@ <h3 id="3.0.3-Python-code:-Pearson-correlation">3.0.3 Python code: Pearson corre
     <div class="input_area">
 <div class=" highlight hl-ipython3"><pre><span></span><span class="n">correlated</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">()</span>
 <span class="n">correlated</span><span class="p">[</span><span class="s2">&quot;x&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
-<span class="n">correlated</span><span class="p">[</span><span class="s2">&quot;y&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="mi">5</span> <span class="o">*</span> <span class="n">correlated</span><span class="o">.</span><span class="n">x</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">correlated</span><span class="o">.</span><span class="n">x</span><span class="p">))</span>
+<span class="n">correlated</span><span class="p">[</span><span class="s2">&quot;y&quot;</span><span class="p">]</span> <span class="o">=</span> <span class="mf">1.5</span> <span class="o">*</span> <span class="n">correlated</span><span class="o">.</span><span class="n">x</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">randn</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">correlated</span><span class="o">.</span><span class="n">x</span><span class="p">))</span>
 
 <span class="n">scaled</span> <span class="o">=</span> <span class="n">correlated</span> <span class="o">/</span> <span class="n">correlated</span><span class="o">.</span><span class="n">std</span><span class="p">()</span>
 
@@ -13518,27 +13518,27 @@ <h3 id="3.0.3-Python-code:-Pearson-correlation">3.0.3 Python code: Pearson corre
   <tbody>
     <tr>
       <th>scipy.stats.pearsonr</th>
-      <td>0.649620</td>
-      <td>3.321709e-07</td>
+      <td>0.249694</td>
+      <td>0.080332</td>
       <td>NaN</td>
       <td>NaN</td>
       <td>NaN</td>
     </tr>
     <tr>
       <th>smf.ols</th>
-      <td>5.012744</td>
-      <td>3.321709e-07</td>
-      <td>5.91995</td>
-      <td>3.310230</td>
-      <td>6.715258</td>
+      <td>1.512744</td>
+      <td>0.080332</td>
+      <td>1.78652</td>
+      <td>-0.189770</td>
+      <td>3.215258</td>
     </tr>
     <tr>
       <th>smf.ols (scaled)</th>
-      <td>0.649620</td>
-      <td>3.321709e-07</td>
-      <td>5.91995</td>
-      <td>0.428985</td>
-      <td>0.870255</td>
+      <td>0.249694</td>
+      <td>0.080332</td>
+      <td>1.78652</td>
+      <td>-0.031324</td>
+      <td>0.530712</td>
     </tr>
   </tbody>
 </table>
@@ -13628,19 +13628,19 @@ <h3 id="3.0.4-Python-code:-Spearman-correlation">3.0.4 Python code: Spearman cor
   <tbody>
     <tr>
       <th>scipy.stats.spearmanr</th>
-      <td>0.634958</td>
-      <td>7.322277e-07</td>
+      <td>0.233421</td>
+      <td>0.102803</td>
       <td>NaN</td>
       <td>NaN</td>
       <td>NaN</td>
     </tr>
     <tr>
       <th>smf.ols (ranked)</th>
-      <td>0.634958</td>
-      <td>7.322277e-07</td>
-      <td>5.694307</td>
-      <td>0.410757</td>
-      <td>0.859159</td>
+      <td>0.233421</td>
+      <td>0.102803</td>
+      <td>1.663134</td>
+      <td>-0.048772</td>
+      <td>0.515615</td>
     </tr>
   </tbody>
 </table>
@@ -15234,7 +15234,7 @@ <h2 id="6.2-Two-way-ANOVA">6.2 Two-way ANOVA<a class="anchor-link" href="#6.2-Tw
 <div class="text_cell_render border-box-sizing rendered_html">
 <h3 id="6.2.2-Python-code:-Two-way-ANOVA">6.2.2 Python code: Two-way ANOVA<a class="anchor-link" href="#6.2.2-Python-code:-Two-way-ANOVA">&#182;</a></h3><div class="alert alert-warning">
     <b>Note on Python port:</b>
-    Unfortunately, <code>scipy.stats</code> does not have any function to perform a two-way ANOVA, so we can't verify that the linear model gives the same results as some other Python statistical function. Nevertheless, we'll go through the motions of performing the linear regression.
+    Unfortunately, <code>scipy.stats</code> does not have a dedicated function to perform two-way ANOVA, so we cannot demonstrate directly that it is fundamentally a linear model. Nevertheless, we will write the code to perform the linear regression.
 </div>
 </div>
 </div>
@@ -15326,7 +15326,7 @@ <h3 id="6.3-ANCOVA">6.3 ANCOVA<a class="anchor-link" href="#6.3-ANCOVA">&#182;</
 <div class="text_cell_render border-box-sizing rendered_html">
 <div class="alert alert-warning">
     <b>Note on Python port:</b>
-    Unfortunately, <code>scipy.stats</code> does not have any function to perform ANCOVA, so again, we can't verify that the linear model gives the same results as some other Python statistical function. Nevertheless, we'll go through the motions of performing the linear regression.
+    Unfortunately, <code>scipy.stats</code> does not have a dedicated function to perform ANCOVA, so again, we cannot demonstrate directly that it is fundamentally a linear model. Nevertheless, we will write the code to perform the linear regression.
 </div>
 </div>
 </div>
@@ -15442,7 +15442,7 @@ <h3 id="7.1.3-Python-code:-Goodness-of-fit">7.1.3 Python code: Goodness of fit<a
 <p>Note that <code>smf.ols</code> does not support GLMs: we need to use <code>sm.GLM</code>. While <code>sm.GLM</code> does not have a <code>patsy</code>-formula interface, we can still use <code>patsy.dmatrices</code> to get the <a href="https://www.statsmodels.org/stable/endog_exog.html"><code>endog</code> and <code>exog</code> design matrices,</a> and then feed that into <code>sm.GLM</code>.</p>
 <div class="alert alert-warning">
     <b>Note on Python port:</b>
-    Unfortunately, <code>statsmodels</code> <a href="https://stackoverflow.com/q/27328623">does not currently support performing a one-way ANOVA test on GLMs</a> (the <code>anova_lm</code> function only works for linear models), so while we can perform the GLM, there is no support for computing the F-statistic or its p-value. Nevertheless, we'll go through the motions of performing the generalized linear regression.
+    Unfortunately, <code>statsmodels</code> <a href="https://stackoverflow.com/q/27328623">does not currently support performing a one-way ANOVA test on GLMs</a> (the <code>anova_lm</code> function only works for linear models), so while we can perform the GLM, there is no support for computing the F-statistic or its p-value. Nevertheless, we will write the code to perform the generalized linear regression.
 </div>
 </div>
 </div>
@@ -15738,7 +15738,7 @@ <h1 id="10-Limitations">10 Limitations<a class="anchor-link" href="#10-Limitatio
 </li>
 <li><p>I have not discussed inference. I am only including p-values in the comparisons as a crude way to show the equivalences between the underlying models since people care about p-values. Parameter estimates will show the same equivalence. How to do <em>inference</em> is another matter. Personally, I'm a Bayesian, but going Bayesian here would render it less accessible to the wider audience. Also, doing <a href="https://en.wikipedia.org/wiki/Robust_statistics">robust models</a> would be preferable, but fail to show the equivalences.</p>
 </li>
-<li><p>Several named tests are still missing from the list and may be added at a later time. This includes the Sign test (require large N to be reasonably approximated by a linear model), Friedman as RM-ANOVA on <code>rank(y)</code>, McNemar, and Binomial/Multinomial. See stuff on these in <a href="#8-Sources-and-further-equivalences">the section on links to further equivalences</a>. If you think that they should be included here, feel free to submit "solutions" to <a href="https://github.com/lindeloev/tests-as-linear/">the GitHub repo</a> of this doc!</p>
+<li><p>Several named tests are still missing from the list and may be added at a later time. This includes the Sign test (require large N to be reasonably approximated by a linear model), Friedman as RM-ANOVA on <code>rank(y)</code>, McNemar, and Binomial/Multinomial. See stuff on these in <a href="#8-Sources-and-further-equivalences">the section on links to further equivalences</a>. If you think that they should be included here, feel free to submit "solutions" to <a href="https://github.com/eigenfoo/tests-as-linear/">the GitHub repo</a> of this doc!</p>
 </li>
 </ol>
 
@@ -15749,7 +15749,7 @@ <h1 id="10-Limitations">10 Limitations<a class="anchor-link" href="#10-Limitatio
 </div><div class="inner_cell">
 <div class="text_cell_render border-box-sizing rendered_html">
 <h1 id="11-License">11 License<a class="anchor-link" href="#11-License">&#182;</a></h1><p><a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a></p>
-<p><em>Common statistical tests are linear models</em>: Python port by <a href="https://eigenfoo.xyz/tests-as-linear/">Jonas Kristoffer Lindeløv and George Ho</a> is licensed under a <a href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p>
+<p><em>Common statistical tests are linear models</em>: Python port by <a href="https://eigenfoo.xyz/tests-as-linear/">George Ho and Jonas Kristoffer Lindeløv</a> is licensed under a <a href="https://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.</p>
 <p>Based on a work at <a href="https://lindeloev.github.io/tests-as-linear/">https://lindeloev.github.io/tests-as-linear/</a>.</p>
 <p>Permissions beyond the scope of this license may be available at <a href="https://github.com/eigenfoo/tests-as-linear">https://github.com/eigenfoo/tests-as-linear</a>.</p>
 
diff --git a/tests-as-linear.ipynb b/tests-as-linear.ipynb
index 37d321d..f3c74fa 100644
--- a/tests-as-linear.ipynb
+++ b/tests-as-linear.ipynb
@@ -21,7 +21,7 @@
     {
      "data": {
       "text/markdown": [
-       "Last updated: June 27, 2019"
+       "Last updated: June 28, 2019"
       ],
       "text/plain": [
        "<IPython.core.display.Markdown object>"
@@ -266,7 +266,7 @@
     "\n",
     "It couldn't be much simpler to run these models with `statsmodels` ([`smf.ols`](https://www.statsmodels.org/stable/example_formulas.html#ols-regression-using-formulas)) or `scipy` ([`scipy.stats.pearson`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html)). They yield identical slopes, `p` and `t` values, but there's a catch: `smf.ols` gives you the *slope* and even though that is usually much more interpretable and informative than the _correlation coefficient_ $r$, you may still want $r$. Luckily, the slope becomes $r$ if `x` and `y` have a standard deviation of exactly 1. You can do this by scaling the data: `data /= data.std()`.\n",
     "\n",
-    "Notice how `scipy.stats.pearsonr` and `smf.ols (scaled)` have the same slopes, $p$ and $t$ values."
+    "Notice how `scipy.stats.pearsonr` and `smf.ols (scaled)` have the same slopes, $p$ and $t$ values. Also note that statistical functions from `scipy.stats` do not provide confidence intervals, while performing the linear regression with `smf.ols` does."
    ]
   },
   {
@@ -277,7 +277,7 @@
    "source": [
     "correlated = pd.DataFrame()\n",
     "correlated[\"x\"] = np.linspace(0, 1)\n",
-    "correlated[\"y\"] = 5 * correlated.x + 2 * np.random.randn(len(correlated.x))\n",
+    "correlated[\"y\"] = 1.5 * correlated.x + 2 * np.random.randn(len(correlated.x))\n",
     "\n",
     "scaled = correlated / correlated.std()\n",
     "\n",
@@ -324,37 +324,37 @@
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>scipy.stats.pearsonr</th>\n",
-       "      <td>0.649620</td>\n",
-       "      <td>3.321709e-07</td>\n",
+       "      <td>0.249694</td>\n",
+       "      <td>0.080332</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>smf.ols</th>\n",
-       "      <td>5.012744</td>\n",
-       "      <td>3.321709e-07</td>\n",
-       "      <td>5.91995</td>\n",
-       "      <td>3.310230</td>\n",
-       "      <td>6.715258</td>\n",
+       "      <td>1.512744</td>\n",
+       "      <td>0.080332</td>\n",
+       "      <td>1.78652</td>\n",
+       "      <td>-0.189770</td>\n",
+       "      <td>3.215258</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>smf.ols (scaled)</th>\n",
-       "      <td>0.649620</td>\n",
-       "      <td>3.321709e-07</td>\n",
-       "      <td>5.91995</td>\n",
-       "      <td>0.428985</td>\n",
-       "      <td>0.870255</td>\n",
+       "      <td>0.249694</td>\n",
+       "      <td>0.080332</td>\n",
+       "      <td>1.78652</td>\n",
+       "      <td>-0.031324</td>\n",
+       "      <td>0.530712</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
-       "                         value      p-values  t-values  0.025 CI  0.975 CI\n",
-       "scipy.stats.pearsonr  0.649620  3.321709e-07       NaN       NaN       NaN\n",
-       "smf.ols               5.012744  3.321709e-07   5.91995  3.310230  6.715258\n",
-       "smf.ols (scaled)      0.649620  3.321709e-07   5.91995  0.428985  0.870255"
+       "                         value  p-values  t-values  0.025 CI  0.975 CI\n",
+       "scipy.stats.pearsonr  0.249694  0.080332       NaN       NaN       NaN\n",
+       "smf.ols               1.512744  0.080332   1.78652 -0.189770  3.215258\n",
+       "smf.ols (scaled)      0.249694  0.080332   1.78652 -0.031324  0.530712"
       ]
      },
      "execution_count": 8,
@@ -427,28 +427,28 @@
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>scipy.stats.spearmanr</th>\n",
-       "      <td>0.634958</td>\n",
-       "      <td>7.322277e-07</td>\n",
+       "      <td>0.233421</td>\n",
+       "      <td>0.102803</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "      <td>NaN</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>smf.ols (ranked)</th>\n",
-       "      <td>0.634958</td>\n",
-       "      <td>7.322277e-07</td>\n",
-       "      <td>5.694307</td>\n",
-       "      <td>0.410757</td>\n",
-       "      <td>0.859159</td>\n",
+       "      <td>0.233421</td>\n",
+       "      <td>0.102803</td>\n",
+       "      <td>1.663134</td>\n",
+       "      <td>-0.048772</td>\n",
+       "      <td>0.515615</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "text/plain": [
-       "                          value      p-values  t-values  0.025 CI  0.975 CI\n",
-       "scipy.stats.spearmanr  0.634958  7.322277e-07       NaN       NaN       NaN\n",
-       "smf.ols (ranked)       0.634958  7.322277e-07  5.694307  0.410757  0.859159"
+       "                          value  p-values  t-values  0.025 CI  0.975 CI\n",
+       "scipy.stats.spearmanr  0.233421  0.102803       NaN       NaN       NaN\n",
+       "smf.ols (ranked)       0.233421  0.102803  1.663134 -0.048772  0.515615"
       ]
      },
      "execution_count": 10,
@@ -1941,7 +1941,7 @@
     "\n",
     "<div class=\"alert alert-warning\">\n",
     "    <b>Note on Python port:</b>\n",
-    "    Unfortunately, <code>scipy.stats</code> does not have any function to perform a two-way ANOVA, so we can't verify that the linear model gives the same results as some other Python statistical function. Nevertheless, we'll go through the motions of performing the linear regression.\n",
+    "    Unfortunately, <code>scipy.stats</code> does not have a dedicated function to perform two-way ANOVA, so we cannot demonstrate directly that it is fundamentally a linear model. Nevertheless, we will write the code to perform the linear regression.\n",
     "</div>"
    ]
   },
@@ -2014,7 +2014,7 @@
    "source": [
     "<div class=\"alert alert-warning\">\n",
     "    <b>Note on Python port:</b>\n",
-    "    Unfortunately, <code>scipy.stats</code> does not have any function to perform ANCOVA, so again, we can't verify that the linear model gives the same results as some other Python statistical function. Nevertheless, we'll go through the motions of performing the linear regression.\n",
+    "    Unfortunately, <code>scipy.stats</code> does not have a dedicated function to perform ANCOVA, so again, we cannot demonstrate directly that it is fundamentally a linear model. Nevertheless, we will write the code to perform the linear regression.\n",
     "</div>"
    ]
   },
@@ -2135,7 +2135,7 @@
     "\n",
     "<div class=\"alert alert-warning\">\n",
     "    <b>Note on Python port:</b>\n",
-    "    Unfortunately, <code>statsmodels</code> <a href=\"https://stackoverflow.com/q/27328623\">does not currently support performing a one-way ANOVA test on GLMs</a> (the <code>anova_lm</code> function only works for linear models), so while we can perform the GLM, there is no support for computing the F-statistic or its p-value. Nevertheless, we'll go through the motions of performing the generalized linear regression.\n",
+    "    Unfortunately, <code>statsmodels</code> <a href=\"https://stackoverflow.com/q/27328623\">does not currently support performing a one-way ANOVA test on GLMs</a> (the <code>anova_lm</code> function only works for linear models), so while we can perform the GLM, there is no support for computing the F-statistic or its p-value. Nevertheless, we will write the code to perform the generalized linear regression.\n",
     "</div>"
    ]
   },
@@ -2429,7 +2429,7 @@
     "\n",
     "3. I have not discussed inference. I am only including p-values in the comparisons as a crude way to show the equivalences between the underlying models since people care about p-values. Parameter estimates will show the same equivalence. How to do *inference* is another matter. Personally, I'm a Bayesian, but going Bayesian here would render it less accessible to the wider audience. Also, doing [robust models](https://en.wikipedia.org/wiki/Robust_statistics) would be preferable, but fail to show the equivalences.\n",
     "\n",
-    "4. Several named tests are still missing from the list and may be added at a later time. This includes the Sign test (require large N to be reasonably approximated by a linear model), Friedman as RM-ANOVA on `rank(y)`, McNemar, and Binomial/Multinomial. See stuff on these in [the section on links to further equivalences](#8-Sources-and-further-equivalences). If you think that they should be included here, feel free to submit \"solutions\" to [the GitHub repo](https://github.com/lindeloev/tests-as-linear/) of this doc!"
+    "4. Several named tests are still missing from the list and may be added at a later time. This includes the Sign test (require large N to be reasonably approximated by a linear model), Friedman as RM-ANOVA on `rank(y)`, McNemar, and Binomial/Multinomial. See stuff on these in [the section on links to further equivalences](#8-Sources-and-further-equivalences). If you think that they should be included here, feel free to submit \"solutions\" to [the GitHub repo](https://github.com/eigenfoo/tests-as-linear/) of this doc!"
    ]
   },
   {
@@ -2440,7 +2440,7 @@
     "\n",
     "<a rel=\"license\" href=\"http://creativecommons.org/licenses/by/4.0/\"><img alt=\"Creative Commons License\" style=\"border-width:0\" src=\"https://i.creativecommons.org/l/by/4.0/88x31.png\" /></a>\n",
     "\n",
-    "_Common statistical tests are linear models_: Python port by [Jonas Kristoffer Lindeløv and George Ho](https://eigenfoo.xyz/tests-as-linear/) is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).\n",
+    "_Common statistical tests are linear models_: Python port by [George Ho and Jonas Kristoffer Lindeløv](https://eigenfoo.xyz/tests-as-linear/) is licensed under a [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/).\n",
     "\n",
     "Based on a work at https://lindeloev.github.io/tests-as-linear/.\n",
     "\n",