-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathindex.Rmd
2241 lines (1465 loc) · 68.1 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title : "Machine Learning for Scientists"
subtitle : "An Introduction"
author : "Jonathan Dursi"
job : "Scientific Associate/Software Engineer, Informatics and Biocomputing, OICR"
framework : io2012 # {io2012, html5slides, shower, dzslides, ...}
highlighter : highlight.js # {highlight.js, prettify, highlight}
hitheme : tomorrow #
widgets : [mathjax] # {mathjax, quiz, bootstrap}
mode : selfcontained # {standalone, draft}
knit : slidify::knit2slides
logo : oicr-trans.png
license : by-nc-sa
github :
user: ljdursi
repo: ML-for-scientists
---
<style type="text/css">
.title-slide {
background-color: #EDEDED;
}
article p {
text-align:left;
}
p {
text-align:left;
}
img {
max-width: 90%
}
</style>
<script type="text/javascript" src="http://ajax.aspnetcdn.com/ajax/jQuery/jquery-1.7.min.js"></script>
<script type="text/javascript">
$(function() {
$("p:has(img)").addClass('centered');
});
</script>
## Purpose of This Course
You should leave here today:
* Having some basic familiarity with key terms,
* Having used a few standard fundamental methods, and have a grounding in the underlying theory,
* Having developed some familiarity with the python package `scikit-learn`
* Understanding some basic concepts with broad applicability.
We'll cover, and you'll use, most or all of the following methods:
| | Supervised | Unsupervised|
|---------------:|:--------------|:---------------|
| CONTINUOUS | **Regression**: OLS, Lowess, Lasso | **Variable Selection**: PCA |
| DISCRETE | **Classification**: Logistic Regression, kNN, Decision Trees, Naive Bayes, Random Forest | **Clustering**: k-Means, Hierarchical Clustering |
--- &twocol
## Techniques, Concepts
... but more importantly, we'll look in some depth at these concepts:
*** =left
* Bias-Variance
* Resampling methods
* Bootstrapping
* Cross-Validation
* Permutation tests
*** =right
* Model Selection
* Variable Selection
* Multiple Hypothesis Testing
* Geometric Methods
* Tree-based Methods
* Probabilistic methods
---
## ML and Scientific Data Analysis
Machine Learning is in some ways very similar to day-to-day scientific data analysis:
* Machine learning is model fitting.
* First, data has to be:
* put into appropriate format for tools,
* quickly summarized/visualized as sanity check ("data exploration"),
* cleaned
* Then some model is fit and parameters extracted.
* Conclusions are drawn.
---
## ML vs Scientific Data Analysis
Many scientific problems being analyzed are already very well characterized.
* Model already defined, typically very solidily;
* Estimating a small number of parameters within that framework.
Machine learning is model fitting...
* But the model may be implicit,
* ... and disposable.
* The model exists to explore the data, or make improved predictions.
* Generally not holding out for some foundational insight into the fundamental nature of our world.
---
## Scientific Data Analysis with ML
But ML approaches can be very useful for science, particularly at the beginning of a research programme:
* Exploring data;
* Uncovering patterns;
* Testing models.
Having said that, there are some potential gotchas:
* Different approaches, techniques than common in scientific data analysis. Takes some people-learning.
* When not just parameters but the model itself up for grabs, one has to take care not to lead oneself astray.
* Getting "a good fit" is not in question when you have all possible models devised by human minds at your disposal. But does it mean anything?
---
## Types of problems
Broadly, data analysis problems fall into two categories:
* **Supervised**: already have some data labelled with (approximately) the right answer.
* *e.g.*, curve fitting
* For prediction. "Train" the model on known data, predict on new unlabelled data.
* **Unsupervised**: discover patterns in data.
* What groups of items in this data are similar? Different?
* For exploration, evaluation, and prediction if useful.
---
## Types of Data
And we will be working on two broad classes of variables:
* **Continuous**: real numbers.
* **Discrete**:
* Binary: True/False, On/Off.
* Categorical: Item falls into category A, category B...
* Ordinal: Discrete, but has an intrinsic order. (Low, Med, High; S, M, L, XL).
Others are possible too -- intervals, temporal or spatial continuous data -- but we won't be considering those today.
--- .segue .dark
## Regression
---
## Regression
We're going to spend this morning discussing regression.
* It's the most familiar to most of us; so
* It's a good place to introduce some concepts we'll find useful through the rest of the day.
In regression problems,
* Data comes in as a set of $n$ observations.
* Each of which has $p$ "features"; we'll be considering all-continuous input features, which isn't necessarily the case.
* And the initial data we have also has a known "endogenous" value, the $y$-value we're trying to fit.
* Goal is to learn a function $y = \hat{f}(x_1, x_2, \dots, x_p)$ for predicting new values.
---
## Ordinary Least Squares (OLS)
As we learned on our grandmothers knee, a good way to fit a functional
form $\hat{y} = \hat{f}(\vec{x};\theta)$ to some data $(\vec{x},y)$
is to minimize the squared error. We assume the data is generated
by some true function
$$
y = f(\vec{x}) + \epsilon
$$
where $\epsilon$ is some irreducible error (here assumed constant), and we choose $\theta$:
$$
\hat{\theta} = \mathrm{argmin}_\theta \sum_i \left ( y_i - \hat{f} (x_i, \theta) \right )^2.
$$
---
## Note on Loss Functions
Here we're going to be using least squares error as the function
to minimize. But this is not the only *loss* *function* one could
usefully optimize.
In particular, least-squares has a number of very nice mathematical
properties, but it puts a lot of weight on outliers. If the residual
$r_i = y_i - \hat{y}_i$ and the loss function is $l = \sum_i
\rho(r_i)$, some "robust regression" methods use different methods:
$$
\begin{eqnarray*}
\rho_\mathrm{LS}(r_i) & = & r_i^2 \\
\rho_\mathrm{Huber}(r_i) & = & \left \{ \begin{array}{ll} r_i^2 & \mathrm{if } |r_i| \le c; \\ c(2 r_i - c) & \mathrm{if } |r_i| > c. \\ \end{array} \right . \\
\rho_\mathrm{Tukey}(r_i) & = & \left \{ \begin{array}{ll} r_i \left ( 1 - \frac{r_i}{c} \right )^2 & \mathrm{if } |r_i| \le c; \\ 0 & \mathrm{if } |r_i| > c. \\ \end{array} \right . \\
\end{eqnarray*}
$$
---
## Note on Loss Functions
Even crazier loss functions are possible --- for your particular
application, it may be worse to over-predict than under-predict
(for instance), so the loss function needn't necessarily be symmetric
around $r_i = 0$.
The "right" loss function is problem-dependent.
For now, we'll just use least-squares, as it's familar and the
mechanics don't change with other loss functions. However, different
algorithms may need to be used to use different loss functions;
many explicitly use the nice mathematical properties of least
squares.
---
## Polynomial Regression
Let's do some linear regression of a noisy, tilted sinusoid:
```{r engine='python'}
import scripts.regression.biasvariance as bv
import numpy
import matplotlib.pylab as plt
x,y = bv.noisyData(npts=40)
p = numpy.polyfit(x, y, 1)
fitfun = numpy.poly1d(p)
plt.plot(x,y,'ro')
plt.plot(x,fitfun(x),'g-')
plt.show()
```
--- &twocol
## Polynomial Regression
*** =left
We should have something that looks like the figure on the right.
Questions we should be asking ourselves whenever we're doing something like this:
* How is this likely to perform on new data?
* How is the accuracy likely to be in the centre ($x = 0$)? At $x = 2$?
* How robust is our prediction - how variable is it likely to be if we had gotten different data?
*** =right
![](outputs/bias-variance/linear-fit.png)
--- &twocol
## Polynomial Regression
*** =left
Repeat the same steps with degree 20.
We can get much better accuracy at our points if we use a higher-order polynomial.
Ask ourselves the same questions:
* How is this likely to perform on new data?
* How is the accuracy likely to be in the centre ($x = 0$)? At $x = 2$?
* How robust is our prediction - how variable is it likely to be if we had gotten different data?
*** =right
![](outputs/bias-variance/twentyth-fit.png)
--- &twocol
## Polynomial Regression - In Sample Error
*** =left
We can generate our fit and then calculate the error on the data we've used to train:
$$
E = \sum_i \left ( y_i - \hat{f}(x_i) \right )^2
$$
and if we plot the results, we'll see the error monotonically going down with the degree of polynomial we use. So should we use high-order polynomials?
*** =right
![](outputs/bias-variance/in-sample-error-vs-degree.png)
---
## Bias-Variance Decomposition
Consider the squared error we get from fitting some regression model $\hat{f}(x)$ to some given set of data $x$:
$$
\begin{eqnarray*}
E \left [ (y-\hat{y})^2 \right ] & = & E \left [ ( f + \epsilon - \hat{f})^2 \right ] \\
& = & E \left [ \left ( f - \hat{f} \right )^2 \right ] + \sigma_\epsilon^2 \\
& = & E \left [ \left ( (f - E[\hat{f}]) - (\hat{f} - E[\hat{f]}) \right )^2 \right ] + \sigma_\epsilon^2 \\
& = & E \left [ \left ( f - E[\hat{f}] \right )^2 \right ] - 2 E \left [ ( f - E[\hat{f}]) (\hat{f} - E[\hat{f}]) \right ] + E \left [ \left (\hat{f} - E[\hat{f}]) \right )^2 \right ] + \sigma_\epsilon^2 \\
& = & \left (E[f] - E[\hat{f}]) \right )^2 + E \left [ \left ( \hat{f} - E[\hat{f}] \right )^2 \right ] + \sigma_\epsilon^2 \\
& = & \mathrm{Bias}^2 + \mathrm{Var} + \sigma_\epsilon^2
\end{eqnarray*}
$$
---
## Bias-Variance Decomposition
$$
\mathrm{MSE} = \left (E[f] - E[\hat{f}]) \right )^2 + E \left [ \left ( \hat{f} - E[\hat{f}] \right )^2 \right ] + \sigma_\epsilon^2 = \mathrm{Bias}^2 + \mathrm{Var} + \sigma_\epsilon^2
$$
* Last term: intrinisic noise in the problem. Can't do anything about it; we won't consider it any further right now.
* First term: **bias** squared of our estimate.
* Is the expectation value of our regression estimate $\hat{f}$, the expectation value of $f$?
* Second term: **variance** of our estimate.
* How robust/variable is our regression estimate, $\hat{f}$?
* Obvious: mean squared error has contribution from both of these terms.
* Less obvious: there is almost always a tradeoff between bias and variance.
--- &twocol
## Bias, Variance in Polynomial Fitting
*** =left
Because this is synthetic data, we can examine bias and variance in our regression estimates:
* Generate many sets of data from the model
* For each one,
* Generate fit with given degree
* Plot fit for visualization
* Generate predictions at a given point (say, $x=0$)
* Examine bias, variance of predictions around zero.
*** =right
![](outputs/bias-variance/lin-bias-variance.png)
---
## Polynomial Fitting - Constant
![](outputs/bias-variance/const-bias-variance.png)
---
## Polynomial Fitting - Linear
![](outputs/bias-variance/lin-bias-variance.png)
---
## Polynomial Fitting - Seventh Order
![](outputs/bias-variance/seventh-bias-variance.png)
---
## Polynomial Fitting - Tenth Order
![](outputs/bias-variance/tenth-bias-variance.png)
---
## Polynomial Fitting - Twentyth Order
![](outputs/bias-variance/twentyth-bias-variance.png)
---
## Bias and Consistency
Bias is a measure of how consistent the model is with the true behaviour.
Very generally, models that are too simple can't capture all of the
behaviour of the system, and so estimators based on these simple models
have higher bias.
As the complexity of model increases, bias tends to go down; the model
can capture whatever behaviour is seen.
---
## Variance and Generalization
Variance is a measure of how sensitive the estimated model is to the particular
set of data it sees.
It is very closely tied to the idea of **generalization**. Is the model learning
trends that will generalize to a new set of data? Or is it just overfitting the
noise of this particular data set?
As the complexity of a model increases, it tends to have higher variance; simple
models typically have very low variance.
Note too that variance typically gets smaller as sample sizes increase; a model
which shows large variance with different 100-point data sets will likely be
much better behaved on 1,000,000-point data sets.
--- &twocol
## Bias-Variance Tradeoff
*** =left
For our polynomial example, if we compare the error in the computed
model with the "true" model (not generally available to us!), we can
plot the error vs the degree of the polynomial:
* For small degrees, the dominant term is bias; simpler models can't capture the true behaviour of the system.
* For larger degrees, the dominant term is variance; more complex models are generalizing poorly and overfitting noise.
There's some sweet spot where the two effects are comparably low.
*** =right
![](outputs/bias-variance/error-vs-degree.png)
---
## Choosing the Right Model
(Or in this case, the "metaparameters" of the model)
In general, we get some dataset - we can't generate more data on a
whim, and we certainly can't just compare to the true model.
So what do we do to choose our model? If we simply fit with higher
order polynomials and calculate the error on the same data set,
we'll find that more complex models are always "better".
How do we do out-of-sample testing when we only have one set of samples?
---
## Training vs Validation
The solution is to hold out some of the data. The bulk of the data is used for training
the model; the remainder is used for validating the trained model against "new" data.
Once the model is chosen, then you train the selected model on the entire training+validiation
data set.
In some cases, you may need still further data; eg, you may need to both choose your
model, and end a paper or report with a sentence like "the final model achieved 80% accuracy...".
This still can't be done on the data the model was trained on (train+validation); in that case,
a second chunk of data must be held out, for testing.
In the case of Training:Validation:Testing, a common breakdown of the data sizes might be
50%:25%:25% of the initial set. If you don't need a test data set, 2/3:1/3 is common.
Note that the data sets should be chosen randomly!
---
## Training and Validation Hold Out Data
Once a model has "touched" a hold-out data set, it's done.
If you iterate on models or model selection having touched all the
data, you're doing exactly the same thing as fitting a 50th-order
polynomial to 52 data points and saying "See? Hardly any error!"
* Once a model (family of models) has seen the validation data set, it's done; can't tune it any more.
* Otherwise, risk eternal torment, gnashing of teeth, etc.
* Once a set of models (set of family of models) have been compared to on the test data set, they're done; no more model selection.
* Eternal torment, gnashing of teeth, etc.
---
## Hands-On: Model Selection
Use this validation-hold-out approach to generate a noisy data set, and choose a degree polynomial to fit the
entire data set. The routines `numpy.random.shuffle()` and `numpy.array_split()` may be helpful.
```{r engine='python'}
import scripts.regression.biasvariance as bv
x,y = bv.noisyData(npts=100)
import numpy
import numpy.random
a = numpy.arange(10)
numpy.random.shuffle(a)
print a
print numpy.array_split(a, 2)
```
---
## $k$-fold Cross Validation
There are some downsides to the approach we've taken for validation hold-out. What if most negative outliers happened to be in the training set?
Ideally, would do several partitions, average over results.
$k$-fold Cross Validation:
* Partition data set (randomly) into $k$ sets.
* For each set:
* Train on the remaining $k-1$ sets
* Validate on the held-out set
* Average results
Makes very efficient use of the data set, easily automated.
---
## $k$-fold Cross Validation
How to choose $k$?
* $k$ too large - the different training sets are very highly correlated (almost all of their points are the same).
* $k$ too small - don't get very much advantage of averaging.
In practice, 10 is a very commonly-used value for $k$.
---
## $k$-fold Cross Validation
Let's look at `scripts/regression/crossvalidation.py`:
```{r, engine='python', eval=FALSE}
err = 0.
for i in range(kfolds):
startv = n*i/kfolds; endv = n*(i+1)/kfolds
test[idxes[startv:endv]] = True
test_x = x[test]; test_y = y[test]
train_x = x[-test]; train_y = y[-test]
p = numpy.polyfit(train_x, train_y, d)
fitf = numpy.poly1d(p)
err = err + sum((test_y - fitf(test_x))**2)
test[:] = False
```
--- &twocol
## $k$-fold Cross Validation
*** =left
Running gives:
```{r, engine='python', eval=FALSE}
from scripts.regression import crossvalidation
crossvalidation.chooseDegree(50)
```
This chooses the degree to fit 50 data points using
cross validation, and tests the results on a new 50 points.
The error is estimated for each degree, and the minimum is chosen.
In practice, may not want to greedily go for the minimum; the
simplest model that is "close enough" to the minimum is generally
a good choice.
*** =right
![](outputs/crossvalidation/CV-polynomial.png)
---
## Resampling Aside #1
### Bootstrapping
Cross-validation is closely related to a more fundamental method, bootstrapping.
Let's say you want to find how some function of your data would behave - say, the range of sample means of your data, or
a mean and standard deviation of an estimation error for a given model (as with CV).
You'd like new sets of data that you could calculate your statistic on, and then look at the distribution of those.
---
## Parametric Bootstrapping
If you _know_ the form of the distribution that describes your data, you can simulate new data sets:
* Fit the distribution to the data;
* Generate synthetic data sets from the now-known distribution to your heart's content;
* Calculate the statistic on these synthetic data sets, and get their distribution.
This works perfectly well if you know a model that will correctly describe your data; and indeed if you do know that,
it would be madness *not* to make use of it in your analysis.
But what if you don't?
---
## Non-parametric Bootstrapping
The key insight to the non-parametric bootstrap is that you already have an unbiased description of the process that generated your data - the data itself.
The apprach for the non-parametric bootstrap is:
* Generate synthetic data sets from the original by resampling;
* Calculate the statistic on these synthetic data sets, and get their distribution.
This is less powerful than parametric bootstrapping **if** the parametric functional form is correctly known; but it is much better if the parametric
functional form is incorrectly known, or not known at all.
Cross-validation is a particular case: CV takes $k$ (sub)samples of the original data set, applied a function (fit the data set to part, calculate error on the remainder),
and calcluates the mean.
Bootstrapping can be used far more generally.
---
## Non-parametric Bootstrapping
Example:
```{r engine='python'}
import numpy
import numpy.random
numpy.random.seed(789)
data = numpy.random.randn(100)*2 + 1 # Normal with sigma=2, mu=1
print numpy.mean(data)
means = [ numpy.mean( numpy.random.choice(data,100) ) for i in xrange(1000) ]
print numpy.mean(means)
print numpy.var(means) # expect: sigma^2/N = 4/100 = 0.04
```
---
## Hands On: Boostrapping Forest Fires
In the file `scripts/boostrap/forestfire.py` is a routine which will load historical data
about forest fires in northeastern Portugal, including the area burned. You can view it
as follows:
```{r engine='python'}
import matplotlib.pylab as plt
import scripts.bootstrap.forestfire as ff
t, h, w, r, area = ff.forestFireData(skipZeros=True)
n, bins, patches = plt.hist( area, 50, facecolor='red', normed=True, log=True )
plt.xlabel('Area burned (Hectares)')
plt.ylabel('Frequency')
plt.show()
```
---
## Hands On: Boostrapping Forest Fires
![](outputs/bootstrap/area-histogram.png)
--- &twocol
## Hands On: Boostrapping Forest Fires
*** =left
Using the non-parametric bootstrap, calculate the 5% and 95%
confidence intervals for the median of the area burned in forest
fires large enough to have their areas recorded. Eg, you would
expect that in 90% of similar data sets, the median would fall
within those confidence intervals.
You should get a distribution of medians that look like the plot
on the right:
*** =right
![](outputs/bootstrap/median-area-histogram.png)
---
## Regression as Smoothing
Regression is just a form of data smoothing - smoothing the data onto a particular functional form.
For instance, consider OLS linear regression on variables we've `centred' - subtracted off the mean.
The OLS regression function is
$$
\hat{y}(x) = \frac{\sum_i x_i y_i}{\sum_j x_j^2} x
$$
We can rewrite this as
$$
\hat{y}(x) = \sum_i y_i \left ( \frac{ x_i x}{\sum_j x_j^2} \right ) = \sum_i y_i w(x,\vec{x})
$$
That is, a weighted average of *all* of the initial $y_i$s.
---
## Regression with Nonparametric Smoothing
To get good prediction results, we don't *need* to choose a particular, parameterized, global functional form to smooth onto.
As with parametric/nonparametric bootstrap,
* Parametric version is more powerful *if* the functional form is correctly known.
* Nonparametric version has to "waste" some information to reconstruct the overall shape.
Many nonparametric approaches are quite local, allowing adaptation to local features.
--- &twocol
## Nonparametric Regression - Kernel
*** =left
* The simplest approach is to smooth the value of each point over some local area with some kernel basis function.
* The regression function is then calculated by summing the input data points convolved with the basis function.
* Commonly used kernel basis functions:
* Gaussian (downside - never falls off to zero)
* Tricubic
* Kernel basis function characterized by some bandwidth.
* How to choose best bandwidth?
*** =right
![](outputs/nonparametric/kernel-demo.png)
![](outputs/nonparametric/kernel-fit.png)
--- &twocol
## Nonparametric Regression - LOWESS
LOESS and LOWESS are two very related methods which use a similar-but-different approach to kernel regression.
*** =left
At each point, a **local** low-order polynomial regression performed, fit to some fixed **number** of neighbouring points, rather than smoothed over a particular area.
Number of neighbours has to be chosen:
* Too few: too noisy.
* Too many: smooth too much.
How are number of neighbours chosen?
*** =right
![](outputs/nonparametric/lowess-fit.png)
---
## `statsmodels`
Statsmodels ( http://statsmodels.sourceforge.net ) is a very useful python package which includes:
* Wide variety of regression tools
* Ties in with pandas and other packages, to give very nice environment for exploring these sorts of questions
* Many tools for hypothesis testing
```{r, engine='python', eval=FALSE}
def lowessFit(x,y,**kwargs):
"""Use statsmodels to do a lowess fit to the given data.
Returns the x and fitted y."""
fit = sm.nonparametric.lowess(y,x,**kwargs)
xlowess = fit[:,0]
ylowess = fit[:,1]
return xlowess, ylowess
```
---
## Nonparametric Regression
* You will sometimes hear nonparametric techniques referred to as "model-free".
* These people are either
* dishonest, or
* too dumb to understand the techniques they're using.
* Either way, not to be trusted.
* Have very specific models, and thus applicability:
* Tend to require constant noise ("homoskedastic")
* Kernel methods work best when data has roughly constant density
* However, the underlying models are both:
* normally weaker than specifying a particular functional form
* fairly transparent - if these methods go wrong, they don't go subtly wrong.
---
## Parametric vs Nonparametric Regression
* If you know the right answer, you should certainly exploit that
* Will get more information out of your data.
* Nonparametric techniques will always have higher variance, bias than the right parametric method, if available
* But it's hard for them to go subtly badly wrong.
* Even if you think you know the right answer and do a parametric regression, you should do a non-parametric regession, as well.
* Costs almost nothing.
* If your parametric regression is correct, non-parametric version should look like a noisier version of same.
* If they differ substantially in some region, now you've learned something potentially quite interesting.
---
## Final Notes on Regression ... For Now
* Always consider nonparametric regression alongside any parametric regression you run.
* It should go without saying (grandmother's knee again) that, whatever regression you run, you should investigate the residuals:
* Unbiased?
* Randomly (normally) distributed?
* Constant amplitude?
* Otherwise, eternal torment, gnashing of teeth, etc.
--- .segue .dark .nobackground
## Classification
---
## Classification
Classification is a broad set of problems that supperficially look a lot like regression:
* Get some data in, with known correct answers
* Learn algorithm that produces new correct answers for new inputs.
But it is greatly complicated by the fact that the labels are discrete:
* Item should either be in category A or category B.
* You don't get partial points for being close; there's no category A$\frac{1}{2}$.
So classification algorithms spend a great deal of time looking for methods to as cleanly
distinguish various categories as possible.
In some cases, it may be useful to have not only a "best guess" classification, but a quantitative
measure of how much evidence there is in that assignment.
---
## Classification Problems
Some classic classification problems:
* Bioinformatics - classify proteins according to their function
* Image processing - what objects exist in the image
* Text categorization:
* Spam filtering
* Sentiment analysis: is this tweet about my company positive or negative?
* Fraud detection
* Market segmentation
Input variables are commonly both continuous and categorical.
---
## Binary vs Multiclass Classification
Classification algorithms can be broken broadly into two categories;
* Binary classification: answers yes/no questions about whether an item is in a particular class;
* Multiclass classification: of $m$ classes, which one does this item belong to?
One approach to multiclass classification is to decompose into $m$ binary classification problems, and
for each item, choose the category in which the item is most securely assigned. (This turning a discrete variable
into a series of binary variables is called "binarization", and arises in other contexts).
But inherently multiclass methods also exist.
---
## Outline
For the next couple of hours, we'll look at the following:
* Decision Trees
* Evaluating binary classifiers
* Confusion matrix
* Precision, Recall
* ROC curves
* Nearest Neighbours
* Logistic Regression
* Naive Bayes
--- &twocol
## Decision Trees
*** =left
A Decision Tree is a structure, and the algorithm which generates
it, which classifies an input based on a number of binary decisions.
"Splits" the data set based on one of the $p$ features of the items.
Can split based on continuous data ("if height < 1.5 m"), or ordinal/discrete
("if category == 'A').
*** =right
![](outputs/classification/basic.png)
---
## Batman Decision Tree
Consider this example ( from Rob Schapire, Princeton:)
|Name | cape | ears | male | mask | smokes | tie | Good/Bad
|--------|-------|-------|-------|-------|--------|------|---------
|Batman | True | True | True | True | False |False | Good
|Robin | True | False | True | True | False |False | Good
|Alfred | False | False | True |False | False | True | Good
|Pengin | False | False | True |False | True | True | Bad
|Catwoman| False | True | False | True | False |False | Bad
|Joker | False | False | True |False | False |False | Bad
"Learn" a decision tree to classify new characters into Good/Bad.
How does your tree do on this test set?
|Name | cape | ears | male | mask | smokes | tie | Good/Bad
|--------|-------|-------|-------|-------|--------|------|---------
|Batgirl | True | True | False | True | False |False | ??
|Riddler | False | False | True | True | False |False | ??
---
## Splitting Algorithms
Consider the following two possible first splits:
* Split based on cape.
* Cape == True: get Batman, Robin (Good)
* Cape == False: get Alfred (Good), Penguin, Catwoman, Joker (Bad)
* or Split based on tie.
* Tie == True: get Alfred (Good), Penguin (Bad)
* Tie == False: get Batman, Robin (Good), Catwoman, Joker (Bad).
There's a sense in which cape is clearly the better split. It leads to two groups, one of which is purely good, and the
other which is almost purely bad.
The other choice gives you two groups just as heterogeneous as the input data.
--- &twocol
## Splitting Algorithms
Typically rank possible splits based on increase in "purity" of the two subgroups it generates.
Information theory: this is clearly a more informitive decision, leads two two groups of much lower entropy.
*** =left
Consider the probability $p$ that a member of one of the subgroups is in a given category. Two common measures
for the "impurity" of the groups generated (higher is worse):
* Gini Index: $p (1 - p)$
* Entropy: $-p \ln p - (1-p) \ln (1 - p)$
*** =right
![](outputs/classification/impurity-plots.png)
---
## Splitting Algorithms
So splitting algorithms look something like this:
* Until every element is in a pure subtree,
* For each variable, consider a split.
* For categorical, consider all levels; split and measure impurity of resulting groups.
* For continuous, use line optimization to choose best point at which to split, keep track of impurity at that point.
* Choose split which maximizes (negative) change in impurity
Actually implementing this leads to a large number of decisions
which effect both quality of results and computational performance.
Common, classic decision tree algorithms you will encounter include
CART and C4.5. Both (and many others) are fine, but have their
particular strengths and weaknesses.
--- &twocol
## Scikit-learn
*** =left
Scikit-learn is a python package which contains a suite of machine