forked from fbkarsdorp/python-course
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathChapter 9 - Learning from Examples.ipynb~
1644 lines (1644 loc) · 61 KB
/
Chapter 9 - Learning from Examples.ipynb~
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
{
"metadata": {
"name": "",
"signature": "sha256:cfdf2e2f87626e61a47356de271a32958896c29bce6ec4af4456a82dc25b3c9b"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Chapter 9 - Learning from Examples"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*-- A Python course for the Humanities*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What are the rules for distinguishing a real e-mail message from spam? You might come up with rules like \"if a message has XXX in the subject\" it is probably spam. Or if the message contains many mentions of large amounts of money, it's probably spam. Spammers change their behavior all the time and their spamming techniques have evolved substantially over time. You will soon realize that hand-crafting a set of rules for spam is practically infeasable. In his famous essay '[A plan for Spam](http://www.paulgraham.com/spam.html)', Paul Graham thought exactly the same thing. Graham proposed to employ techniques from Machine Learning to automatically *learn* from a number of examples which features distinguish best between spam and non-spam. The basic idea of his spam filter is that we keep track of the frequencies with which words occur in spam messages and in non-spam messages and compare the frequencies of the words in a new, previously unseen e-mail to those statistics to decide whether we're dealing with spam or not.\n",
"\n",
<<<<<<< HEAD
"Graham's spam filter is an example of a *supervised learning system*. We call it supervised, because on the basis of a set of training examples of which the label is known (e.g. spam or non-spam), we attempt to automatically find a way to distinguish between the two categories. More formally, supervised learning systems estimate functions $f : X \\rightarrow Y$ from a multi-dimensional input space $X$ to a set of discrete classes $Y$. In the example of the spam filter, $f$ would be the filter we try to learn, $X$ could be the word frequencies of the e-mail messages (both spam and non-spam) and $Y$ would be the two classes spam and non-spam ($Y = \\{ \\textrm{spam},\\textrm{non-spam}\\})$. \n",
=======
"Graham's spam filter is an example of a *supervised learning system*. We call it supervized, because on the basis of a set of training examples of which the label is known (e.g. spam or non-spam), we attempt to automatically find a way to distinguish between the two categories. More formally, supervized learning systems estimate functions $f : X \\rightarrow Y$ from a multi-dimensional input space $X$ to a set of discrete classes $Y$. In the example of the spam filter, $f$ would be the filter we try to learn, $X$ could be the word frequencies of the e-mail messages (both spam and non-spam) and $Y$ would be the two classes spam and non-spam ($Y = \\{ \\textrm{spam},\\textrm{non-spam}\\})$. \n",
>>>>>>> upstream/master
"\n",
"Have a look at the following example. Say we have a training set of four e-mail messages, two spams and two non-spams. In those four e-mails we find a total of 5 unique words. For each training e-mail $x \\in X$, we write down how often each of those words occurs in $x$:\n",
"\n",
"\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sunny</th>\n",
" <th>friend</th>\n",
" <th>viagra</th>\n",
" <th>dollar</th>\n",
" <th>million</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>non-spam</th>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>non-spam</th>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ham</th>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>ham</th>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"\n",
"Now we get a new e-mail in our inbox that has the following features:\n",
"\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>sunny</th>\n",
" <th>friend</th>\n",
" <th>viagra</th>\n",
" <th>dollar</th>\n",
" <th>million</th>\n",
" </tr>\n",
" </thead>\n",
" <tr>\n",
" <th>?</th>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 1</td>\n",
" <td> 0</td>\n",
" </tr>\n",
"</table>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What do you think? Is it spam or non-spam? Why? There you have it, you have made a prediction on the basis of training examples. This is essentially what Machine Learning attempts to do, only it tries to do it automatically using a computer...\n",
"\n",
"In this chapter we will introduce you to an important supervised machine learning algorithm: naive bayes. There are many excellent introductions and tutorials to machine learning in which this algorithm is discussed. I will not repeat those here, but try to focus on the applicability of the algorithm to a problem relevant in humanities research. This chapter contains another practical part in which we will develop an Authorship Attribution System. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Naive Bayes Learner"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You might say that the attribution of authors to pieces of art, such as literature, music and paintings has been a source of fierce debate ever since art came into existence. Nowadays, the field of stylometry and authorship attribution draws heavily on the aid of computational systems, such as machine learning systems and artificial intelligence systems to automatically learn characteristic features for particular authors, composers or painters. In this section we will develop a simple computational attribution system. At the core of this system will be a general naive bayes classifier that can be put to use for all kinds of similar problems, such as text classification and classification in general.\n",
"\n",
"Our Authorship Attribution System can be thought of as a subclass of a naive bayes learner which in its turn can be thought of as a subclass of a more general learner. This general learner doesn't actually do anything but provides a framework for the more complex learners. Let's start with an implementation of the abstract learner:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class Learner:\n",
" \"\"\"Abstract Learner. A Learner can be trained\n",
" with a dataset X and the corresponding labels y. After training\n",
" the Learner can be asked to predict the outcome of an example.\"\"\"\n",
"\n",
" def fit(self, X, y):\n",
" \"\"\"Fit or train the learner.\"\"\"\n",
" self.X = X\n",
" self.y = y\n",
"\n",
" def predict(self, x):\n",
" \"\"\"Predict the outcome of an example x.\"\"\"\n",
" raise NotImplementedError"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A `Learner` must specify two methods: `fit` and `predict`. `fit` takes as argument some multi-dimensional input space $X$ where for each example $x \\in X$ we have observed a class label $y_x$ where $y \\in Y$."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Quiz!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Have a look at the following rather dumb learner that simply predicts that the class label for each new unseen example is the label that occurs most often in the training data. Complete the `fit` method in the code below:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class MajorityLearner(Learner):\n",
" \"\"\"Dumb Learner that always predicts that the class of\n",
" a new, unseen example is the label that occurs most often\n",
" in the training data.\"\"\"\n",
" \n",
" def fit(self, X, y):\n",
" \"Find that label in the training data that occurs most often.\"\n",
" # insert your code here\n",
" \n",
" def predict(self, x):\n",
" \"Always predict that `x`'s label is equal to the most popular one.\"\n",
" return self.most_popular\n",
" \n",
"# these tests should return True if your code is correct\n",
"learner = MajorityLearner()\n",
"X = [('a', 'a', 'b'), ('a', 'a', 'a'), ('b', 'b', 'b')]\n",
"y = [0, 0, 1]\n",
"learner.fit(X, y)\n",
"print(learner.predict(('c', 'c', 'c')) == 0)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The naive bayes classifier is a probabilistic classifier that, given a set of features, tries to find the class with the highest probability. It is based on applying Bayes' theorem and is called naive because of its strong independence assumption between features. This means that the absence or presence of each feature is assumed to be independent of each other. We compute the posterior probability of a class as the product of the prior probability of a class and the joint probability of all features given that class:\n",
"\n",
"$$ P(y|x_1,\\ldots,x_n) \\propto P(y) \\prod^n_{i=1} P(x_i|y)$$\n",
"\n",
"Classification is based on the *maximum a posteriori* or MAP descision rule which simply picks the class (or author in our case) that is most probable:\n",
"\n",
"$$ predict(x_1, \\ldots, x_n) = \\arg\\max_y P(y) \\prod^n_{i=1} P(x_i|y) $$\n",
"\n",
"If you're unfamiliar with reading formulas, this might all seem quite daunting. To better understand what is going on, let's work out a small example. Say we have a small corpus of five very short books of which the author of the fifth book is unknown. The total vocabulary $V$ is 10 words long. For each book we store how often each word $w_i \\in V$ occurs:\n",
"\n",
"\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>the</th>\n",
" <th>poetry</th>\n",
" <th>society</th>\n",
" <th>america</th>\n",
" <th>realism</th>\n",
" <th>a</th>\n",
" <th>harry</th>\n",
" <th>magic</th>\n",
" <th>health</th>\n",
" <th>system</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>David Foster Wallace</th>\n",
" <td> 6</td>\n",
" <td> 4</td>\n",
" <td> 3</td>\n",
" <td> 3</td>\n",
" <td> 0</td>\n",
" <td> 10</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 1</td>\n",
" <td> 4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>David Foster Wallace</th>\n",
" <td> 8</td>\n",
" <td> 1</td>\n",
" <td> 2</td>\n",
" <td> 4</td>\n",
" <td> 0</td>\n",
" <td> 7</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 10</td>\n",
" <td> 3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Walt Whitman</th>\n",
" <td> 12</td>\n",
" <td> 10</td>\n",
" <td> 1</td>\n",
" <td> 8</td>\n",
" <td> 0</td>\n",
" <td> 4</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>J.K. Rowling</th>\n",
" <td> 8</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 15</td>\n",
" <td> 12</td>\n",
" <td> 10</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>?</th>\n",
" <td> 7</td>\n",
" <td> 4</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 0</td>\n",
" <td> 12</td>\n",
" <td> 6</td>\n",
" <td> 8</td>\n",
" <td> 3</td>\n",
" <td> 0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"\n",
"What is the probability of $P(y=\\textrm{Walt Whitman}|x = [7, 4, 0, 0, 0, 12, 6, 8, 3, 0])$? And what is the probability of $P(y=\\textrm{J.K. Rowling}|x = [7, 4, 0, 0, 0, 12, 6, 8, 3, 0])$?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The probability of a word like *the* given some author is computed by dividing the number of occurences of that word by the total number of words for that author. In the case of Walt Whitman, the probability of the word *poetry* is:\n",
"\n",
"$$\n",
"\\begin{array}{lll}\n",
"P(x_i=\\textrm{poetry}|y=\\textrm{Walt Whitman}) & = & \\frac{10}{12 + 10 + 1 + 8 + 0 + 4 + 0 + 0 + 0 + 4}\\\\\n",
" & = & \\frac{10}{39} \\\\\n",
" & = & 0.256 \\\\\n",
"\\end{array}\n",
"$$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Quiz!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Compute the probability of the word *magic* given J.K. Rowling and the word *society* given David Foster Wallace."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Double click this cell and write down your answer.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The posterior probability of a class computes the joint probability of all features given that class. This means that for each nonzero word $w_1, w_2, \\ldots, w_n$ in our unknown book, we compute the probability of that word given a particular author $y$: $P(w_i|y)$. We then take the product (joint probability) of these individual words, which multiplied by the prior probability of the author, provides us with the posterior probability. Let's find out how that works. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Quiz!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**a)** Compute the joint probability of all features in our unknown book given $y$=J.K. Rowling:\n",
"\n",
"$$\\begin{array}{llll}\n",
"P(y|X_?) & \\propto & (P(\\textit{the}|y) \\times 7) & \\times \\\\\n",
" & & (P(\\textit{poetry}|y) \\times 4) & \\times \\\\\n",
" & & (P(\\textit{a}|y) \\times 12) & \\times \\\\\n",
" & & (P(\\textit{harry}|y) \\times 6) & \\times \\\\\n",
" & & (P(\\textit{magic}|y) \\times 8) & \\times \\\\\n",
" & & (P(\\textit{health}|y) \\times 3) & \\\\ \n",
" & = & ?\n",
"\\end{array}$$"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# insert your code here"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**b**) A common strategy to surpass this rather unpleasant outcome is to add pseudocounts to the observed counts, normally 1. The pseudocounts need to be incorporated in both the numerator and the denominator. This is called smoothing the distribution using [Laplacian Smoothing](https://en.wikipedia.org/wiki/Laplacian_smoothing). Recompute the joint probability using our smoothing technique."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# insert your code here"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**c)** To compute the final posterior probability we must multiply the joint probability of the words given J.K. Rowling with the prior probability that a book was written by J.K. Rowling in our corpus. Do that in the cell below:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# insert your code here"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**d)** Now that we know how to compute the posterior probability, it is time to implement our naive bayes learner. We will start with implementing the `fit` method. The `fit` method has two core jobs:\n",
"\n",
"1. extract all the counts of each feature given each class;\n",
"2. count how often each class label occurs in the training data.\n",
"\n",
"The `NaiveBayesLearner` below provides the skeleton of our class. Implement the `fit` method."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from collections import defaultdict, Counter\n",
"\n",
"\n",
"class NaiveBayesLearner(Learner):\n",
" \"\"\"Naive Bayes Learner. This learner can be\n",
" initialized using nb = NaiveBayesLearner()\n",
" to construct a classifier that incorporates the prior\n",
" probabilities of each possible class label.\"\"\"\n",
"\n",
" def fit(self, X, y):\n",
" \"\"\"Fit or train the naive bayes classifier. X must be an\n",
" iterable of examples where each example is an iterable as\n",
" well.\"\"\"\n",
" self.C = # insert your code here (class counts)\n",
" self.N = # insert your code here (feature counts per class)\n",
" # add the feature counts per class here\n",
" \n",
" def predict(self, x):\n",
" \"\"\"Predict the outcome for example x. Choose the most\n",
" likely outcome out of all possible outcomes.\"\"\"\n",
" pass\n",
" \n",
"# these tests should return True if your code is correct\n",
"nb = NaiveBayesLearner()\n",
"X, y = [('a', 'a', 'a', 'b'), ('a', 'b', 'b')], [1, 0]\n",
"nb.fit(X, y)\n",
"print(nb.N[1]['b'] == 1 and nb.C[1] == 1)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we have specified the `fit` method of our learner, we can continue with the `predict` method. `predict` takes as argument a single example and returns the class with the maximum posterior probability. This means that within `predict` we have to compute the posterior probability of all unique class labels in our training data. We could do all of the computation in the single method `predict`, but our code will remain a little more clean if we split it into a number of helper functions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Quiz!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**a)** We will start with a method that computes the probability of a feature $x_i$ given a class label $y$. The method `probability` in the code below takes as argument a feature $x_i$ and and class label $y$ and should return the probability $P(x_i|y)$."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class NaiveBayesLearner(Learner):\n",
" \"\"\"Naive Bayes Learner. This learner can be\n",
" initialized using nb = NaiveBayesLearner()\n",
" to construct a classifier that incorporates the prior\n",
" probabilities of each possible class label.\"\"\"\n",
"\n",
" def fit(self, X, y):\n",
" \"\"\"Fit or train the naive bayes classifier. X must be an\n",
" iterable of examples where each example is an iterable as\n",
" well.\"\"\"\n",
" self.C = Counter(y)\n",
" self.N = defaultdict(Counter)\n",
" for x, y_x in zip(X, y):\n",
" self.N[y_x] += Counter(x)\n",
"\n",
" def prior(self, y):\n",
" \"\"\"Return the prior probability of class y.\"\"\"\n",
" pass\n",
" \n",
" def probability(self, x, y):\n",
" \"\"\"Apply Laplace Smoothing to give a probability\n",
" estimate of feature x given y.\"\"\"\n",
" # insert your code here\n",
"\n",
"# these tests should return True if your code is correct\n",
"nb = NaiveBayesLearner()\n",
"X, y = [('a', 'a', 'a', 'b'), ('a', 'b', 'b'), ('b', 'b', 'b')], [1, 0, 0]\n",
"nb.fit(X, y)\n",
"print(abs(nb.probability('a', 1) - 0.66666) < 0.00001)\n",
"print(abs(nb.probability('c', 0) - 0.125) < 0.00001)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**b)** Next, complete the method `prior` in the class definition above (remove the `pass` statement). It takes as argument a class label and should return the prior probability of that label."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# these tests should return True if your code is correct\n",
"nb = NaiveBayesLearner()\n",
"X, y = [('a', 'a', 'a', 'b'), ('a', 'b', 'b'), ('b', 'b', 'b')], [1, 0, 0]\n",
"nb.fit(X, y)\n",
"print(round(nb.prior(1), 2) == 0.33)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With the two methods `prior` and `probability` in place, we are ready to implement the final `predict` method. Before we do that, however, there is a small technical issue that must be discussed. The formula above states that we have to compute the joint probability of all features given a label as the product over the individual probabilities. However, each probability is a number between 0 and 1 and when we multiply those numbers, the resulting number will be smaller than both individual probabilities. Now imagine we have a thousand features or -- and in natural language that is still quite small -- 100k features. If we now multiply the probabilities of all these features we will get a very small number, possibly too small to be adequately represented by Python. Consider the code below:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"x = 1e-10\n",
"for i in range(10):\n",
" x = x * x\n",
" print(x)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After less than 10 multiplications, the values are too small for Python to distinguish them from each other. Even worse, the values default to zero and as we know multiplying by zero will return zero, and therefore the result of computing the joint probability could be zero! We therefore take the log of the individual feature probabilities and sum them to obtain our final score. This is equivalent to taking the product of the probabilties."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from math import log\n",
"\n",
"x = log(1e-10)\n",
"for i in range(10):\n",
" x = x + x\n",
" print(x)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Quiz!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Implement the method `predict`. It takes a argument a list or some other iterable of values and should return the label that maximizes the posterior probability."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class NaiveBayesLearner(Learner):\n",
" \"\"\"Naive Bayes Learner. This learner can be\n",
" initialized using nb = NaiveBayesLearner()\n",
" to construct a classifier that incorporates the prior\n",
" probabilities of each possible class label.\"\"\"\n",
"\n",
" def fit(self, X, y):\n",
" \"\"\"Fit or train the naive bayes classifier. X must be an\n",
" iterable of examples where each example is an iterable as\n",
" well.\"\"\"\n",
" self.C = Counter(y)\n",
" self.N = defaultdict(Counter)\n",
" for x, y_x in zip(X, y):\n",
" self.N[y_x] += Counter(x)\n",
"\n",
" def prior(self, y):\n",
" \"\"\"Return the prior probability of class y.\"\"\"\n",
" return self.C[y] / sum(self.C.values())\n",
"\n",
" def probability(self, x, y):\n",
" \"\"\"Apply Laplace Smoothing to give a probability\n",
" estimate of feature x given y.\"\"\"\n",
" return (self.N[y][x] + 1.0) / (sum(self.N[y].values()) + len(self.N))\n",
"\n",
" def predict(self, x):\n",
" \"\"\"Predict the outcome for example x. Choose the most\n",
" likely outcome out of all possible outcomes.\"\"\"\n",
" # insert your code here\n",
"\n",
"# these tests should return True if your code is correct\n",
"nb = NaiveBayesLearner()\n",
"X, y = [('a', 'a', 'a', 'b'), ('a', 'b', 'b'), ('b', 'b', 'b')], [1, 0, 0]\n",
"nb.fit(X, y)\n",
"print(nb.predict(('a', 'a')) == 1 and \n",
" nb.predict(('a', 'b')) == 0 and\n",
" nb.predict(('b', 'b')) == 0)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"Authorship Attribution Learner"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Authorship attribution is a more specific instance of a more general problem and can therefore be thought of as a subclass of a `NaiveBayesLearner`. In textual authorship attribution our input space will be represented by textual features and the class labels are the authors we want attribute to a written document. Before we implement the `Learner`, we must first define a number of functions to extract features from running text.\n",
"\n",
"What kind of features can we use for authorship attribution? Words, defined as everything surrounded by spaces, are generally conceived to be good features. The same holds for bigrams of words and character $n$-grams. However, [there ain't no such thing as a free lunch](https://en.wikipedia.org/wiki/No_free_lunch_theorem). So, let's not restrict ourselves to a single feature representation but experiment with a number of different representations and see what works best.\n",
"\n",
"In the folder `data/british-novels` you will find 26 famous British novels downloaded from [Project Gutenberg](http://www.gutenberg.org/wiki/Main_Page). This is a small toy dataset that we will use in our experiments. First, we will create a simple representation of a document. I choose to represent each document as a tuple of an author, a title and the actual text. Instead of ordinary tuples we will use the `namedtuple` from the [collections](https://docs.python.org/3.4/library/collections.html#collections.namedtuple) module in Python's standard library. A `namedtuple` can be constructed as follows:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from collections import namedtuple\n",
"\n",
"Point = namedtuple(\"Point\", [\"x\", \"y\"])\n",
"p = Point(1, y=2)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"where the first argument represents the typename of the tuple and the second argument specifies the fieldnames. These fieldnames can be conveniently accessed using:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print(p.x) # by name\n",
"print(p[1]) # by index"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Namedtuples follow the exact same behavior as regular tuples: they are hashable and can be unpacked:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"x, y = p\n",
"print(x)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We define a namedtuple for our documents as follows:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"Document = namedtuple(\"Document\", [\"author\", \"title\", \"text\"])"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Quiz!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**a)** Next we will write some functions to read a file and extract the contents as either an iterable of word $n$-grams or character $n$-grams. We start with a function to extract all $n$-grams (either character or word) from an iterable. The function should take as argument a fixed order iterable, and a tuple representing the $n$-gram range. When a tuple of (1, 2) is given, the function should return all unigrams and bigrams in the iterable. When passed (1, 1) as $n$-gram range, it should only return unigrams. Implement the function `ngrams`:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"def ngrams(iterable, ngram_range=(1, 1)):\n",
" \"Return the iterable as a sequence of n-grams.\"\n",
" min_n, max_n = ngram_range\n",
" if isinstance(iterable, (list, set)):\n",
" iterable = tuple(iterable)\n",
" ngrams = []\n",
" # insert your code here\n",
" return ngrams\n",
"\n",
"# these tests should return True if your code is correct\n",
"print(ngrams(\"humanities\", ngram_range=(2, 3)) == ['hu', 'um', 'ma', 'an', 'ni', 'it', \n",
" 'ti', 'ie', 'es', 'hum', 'uma', 'man', \n",
" 'ani', 'nit', 'iti', 'tie', 'ies'])\n",
"print(ngrams(\"teach me Python\".split(), ngram_range=(1, 2)) == [\n",
" ('teach',), ('me',), ('Python',), ('teach', 'me'), ('me', 'Python')])"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**b)** Can you think of any reason why we convert our iterable into a tuple?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Double click this cell and write down your answer.*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will write a function `make_document` that takes as argument a filename and returns an instance of our named tuple `Document`. Each filename in `data/british-novels` consist of the author and the title separated by an underscore. This allows us to use the filenames to easily extract the title and the author. The function `make_document` takes as argument a filename, an $n$-gram range, an argument that states whether to lowercase the text the type of $n$-grams (either word or char) and how large the sample of each text should be:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from os.path import basename\n",
"from itertools import islice\n",
"import re\n",
"\n",
"def tokenize(text, lowercase=True):\n",
" \"Tokenize a string on whitespace.\"\n",
" text = text.lower() if lowercase else text\n",
" for match in re.finditer(r\"\\w+(\\.?\\w+)*\", text):\n",
" yield match.group()\n",
"\n",
"def make_document(filename, ngram_range=(1, 1), lowercase=True, ngram_type='word', sample=5000):\n",
" with open(filename) as infile:\n",
" text = list(islice(tokenize(infile.read(), lowercase=lowercase), sample))\n",
" if ngram_type == 'char':\n",
" text = ' '.join(text)\n",
" author, title = basename(filename).replace('.txt', '').split('_')\n",
" return Document(author, title, ngrams(text, ngram_range=ngram_range))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The sample size is important since we want to control for the fact that some documents are longer than others which might influence our distributions.\n",
"\n",
"OK, so we have our feature extraction in place as well as our naive bayes learner. Let's now focus on implementing the authorship attribution learner. As said, we will implement the learner as a subclass of the `NaiveBayesLearner`. The prediction method will remain the same, but we have to make some small changes to the `fit` method that specifically target the authorship attribution problem. \n",
"\n",
"A fundamental insight from stylometry and computational authorship attribution is that function words (e.g. *the*, *of*, *in* etc) are particularly suited for the problem of authorship attribution because they are writer invariant. Content words (nouns, verbs and adjectives) draw too much attention to the actual contents of a text which makes it hard to find commonalities between two texts written by the same author when they deal with different topics. Therefore researchers tend to extract only the $n$ most frequent words from a text (which dominantly consist of function words). "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "heading",
"level": 4,
"metadata": {},
"source": [
"Quiz!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Implement the `fit` method in the `AuthorshipLearner` class below. Contrary to our previous implementation, here you should only store the $n$ most frequent features."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"class AuthorshipLearner(NaiveBayesLearner):\n",
" \"\"\"Subclass of NaiveBayesLearner that targets the authorship\n",
" attribution problem. The learner can be initialized using\n",
" >>> al = AuthorshipLearner(n_most_frequent=100)\n",
" where the argument `n_most_frequent` is used to specify\n",
" the n most frequent features. If set to None, all features\n",
" will be used.\"\"\"\n",
" \n",
" def __init__(self, n_most_frequent=100):\n",
" self.n_most_frequent = n_most_frequent\n",
" \n",
" def fit(self, X, y):\n",
" \"\"\"Fit or train the Authorship Learner. X must be an\n",
" iterable of examples where each example is an iterable as\n",
" well.\"\"\"\n",
" self.C = Counter(y)\n",
" self.N = defaultdict(Counter)\n",
" # insert your code here\n",
" \n",
"# these tests should return True if your code is correct\n",
"X = [('a', 'a', 'a', 'b', 'b', 'c'), ('a', 'c', 'b', 'b'), ('b', 'b', 'b', 'c', 'c', 'a')]\n",
"y = [1, 0, 0]\n",
"al = AuthorshipLearner(n_most_frequent=2)\n",
"al.fit(X, y)\n",
"print('c' not in al.N[1] and set(['b', 'c']) == set(al.N[0].keys()))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},