-
Notifications
You must be signed in to change notification settings - Fork 0
/
stat.Rmd
877 lines (527 loc) · 29.6 KB
/
stat.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
---
output:
pdf_document:
toc: yes
toc_depth: '4'
html_document:
toc: yes
toc_float: yes
toc_depth: 4
---
# Frequentists vs Bayesians
## Symbol Interpretation
$X$: a specific data set
$\theta$ : parameter of the model
$P(X|\theta)$:There are two explanations depending on whether $X$ or $\theta$ is variable or not
1. **Likelihood estimation**: $X$ is given,and $\theta$ is variable,the function is called likelihood function,describing the probability of the occurrence of sample point x for different model parameters.
2. **Probability function**: $\theta$ is known, and $X$ is the variable. This function is called probability function, which describes the probability of occurrence for different sample points $X$.
## Frequentists
Consider the parameter $\theta$ to be inferred as a fixed and unknown constant, which has nothing to do with $X$, sample $X$ is random, and the relevant probability calculations are for the distribution of X in the sample space
$\theta$ :fixed and unknown constant
$X$:random variable
### Maximum Likelihood Estimation(MLE)- common used by frequentists
#### Explanation of MLE
Choose a value of $\theta$ to maxmize the likelihood function given the observed data. That is to say, to predict the parameter $\theta$ from the observed data, relying only on the observed data (sample).
$$ \theta_{MLE}={\operatorname{argmax}}\log P(x|\theta) $$
#### Solving steps
- Find the likelihood function $P(x|\theta)$
- Find the corresponding logarithmic likelihood function $L=\log P(x|\theta)$ (easier to take the derivative; the argmax is the same since log function is monotonically increasing; avoid numerical problems involved with multiplying lots of small numbers)
- calculate the argmax by setting the first derivative equal to zero and solving for $\theta_{MLE}$
## Bayesians
The parameter $\theta$ is regarded as a random variable, and the sample X is fixed. Focusing on the parameter space, the distribution of parameter $\theta$ is emphasized, and the posterior distribution of parameter is obtained by combining the prior distribution of parameter with the sample information
$\theta$: random variable
$X$: fixed
$P(\theta)$: a prejudgment of $\theta$ in the absence of any observed data
$\theta \sim p(\theta)$: $\theta$ satisfies a preset prior distribution
Then, the corresponding posterior probability is
$$
P(\theta|X)=\frac {P(X|\theta)P(\theta)}{P(x)}
$$
which is in direct proportion to $ P(X|\theta)P(\theta)$
### Maximum a-posteriori(MAP) - common used by Bayesians
#### Explanation of MAP
When estimating the parameter θ of the model, we not only rely on the observation data in hand, but also from a prior. Priori can be understood as the experience of experts. MAP is to find $\theta$ by maximizing the posterior probability.
Since $p(X)$ does not depend on $\theta$
$$
\theta_{MAP}=\underset{\theta}{\operatorname{argmax}} P(\theta|X)=\underset{\theta}{\operatorname{argmax}}P(X|\theta)P(\theta)
$$
#### Solving steps
- Determine the prior distribution and likelihood function of parameters
- Find the log posterior distribution function of the parameters,
$L=\sum_{x_{i} \in \mathbf{X}}\left\{\log \operatorname{P}\left(x_{i} \mid \theta\right)\right\}+\log \operatorname{P}(\theta)$, which is almost the same as the MLE estimate except that we now have an additional term resulting from the prior
- Find $\theta_{MAP}$ by setting the first derivative of $L$ equal to zero.
### Bayesian estimation
Bayesian estimate is an extension of a maximum posteriori estimate. Both MLE and MAP estimate a particular value (point estimate) of $\theta$, but Bayesian estimation estimates the posteriori distribution $P(X|\theta)$ of , so in A Bayesian estimate, the prior distribution $P(X)$ is not negligible.
$$
P(\theta|X)=\frac {P(X|\theta)P(\theta)}{P(x)}=\frac{p(X \mid \theta) \cdot p(\theta)}{\int_{\theta} p(X \mid \theta) \cdot p(\theta) d \theta}
$$
### Bayesian forecasting
Bayesian prediction is commonly used to estimate the probability of the occurrence of a new measurement, for the occurrence of a new data $\hat{x}$
$$
p\left(\hat{x} \mid X\right)=\int_{\theta} p\left(\hat{x} \mid \theta\right) \cdot p(\theta \mid X) d \theta
$$
## Remark
There are essential differences between frequentists and Bayesians in their cognition of the world. Frequentists believe that the world is fixed and there is an ontology whose truth value is unchanged. Our goal is to find the truth value or the range of truth value. Bayesians believe that the world is uncertain, and people have a prediction of the world, and then adjust the prediction through observational data. Our goal is to find the optimal probability distribution that describes the world.
## Gaussian Distribution-MLE
### Overview
$$ \text{Data: }X = (x_1,x_2,…,x_N)^T, \quad \text{. $x_i$ is p-dimensional vector } $$
A row is a row of data, and a column is the value of different data under a certain property
$x_i$ is independently identically distributed
$x_i \sim N(\mu, \Sigma)$, $\mu$ is the expectation, $\Sigma$ is the variance matrix
<u>independent and identically distributed</u>: Random variables are independent of each other and obey the same probability distribution
Parameter $\theta = (\mu, \Sigma )$
**Use MLE to find $\theta$ **: Suppose that there is a set of data subject to the Gaussian distribution, the purpose is to find the mean and variance of the Gaussian distribution, so that the probability of the observation of this batch of data is maximum
$\theta_{MLE}$ = argmax P(X|$\theta$)
### One-dimension MLE
Firstly, we consider the one-dimension condition
let p=1, $\theta$=($\mu$, $\sigma^2$)
since data <u>independently identically distributed</u>, the probability can be written as a multiplicative
$$
\log P(X|\theta)$= $\log\prod_{i=1}^n P(x_i,\theta)$=$\sum_{i=1}^N\log P(x_i|\theta)
$$
The probability density function PDF of gaussian distribution is
$$
p(x|\mu,\Sigma)=\frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}e^{-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)}
$$
**pdf under one dimension**
$$
\log p(X|\theta)=\sum_ {i=1}^{N}\log p(x_i|\theta)=\sum_{i=1}^{N}\log\frac{1}{\sqrt{2\pi}\sigma}\exp(-(x_i-\mu)^{2}/2\sigma^{2})=\sum_{i=1}^{N}[log\frac{1}{\sqrt{2\pi}}+log\frac{1}{\sigma}-\frac{(x_i-\mu)^2}{2\sigma^2}]
$$
Then we can calculate $\mu_{MLE} \text{ and }\sigma^2_{MLE}$
**Find the items related to $\mu$**
$\mu_{MLE}$=$\underset{\mu}{\operatorname{argmax}}\log P(X|\theta)$=$\underset{\mu}{\operatorname{argmin}}$$\sum_{i=1}^{N}(x_i-\mu)^2$
**Calculate the partial derivative of $\mu$**
$$ \sum(x_i-\mu)=0 \\\mu_{MLE}=(\sum_{i=1}^{N}x_i)/N(\text{unbiased estimation}\quad E[\mu_{MLE}]=\mu) $$
**Find the items related to **$\sigma^2$
$\sigma^2_{MLE}$ =$\underset{\sigma}{\operatorname{argmax}}$($\log\frac{1}{\sigma}-\frac{(x_i-\mu)^2}{2\sigma^2})$
**Calculate the partial derivative of** $\sigma$
$$\sum_{i=1}^N [-\sigma^2+(x_i-\mu)^2]=0 \\\sigma^2_{MLE}=\frac{1}{N}\sum_{i=1}^N (x_i-\mu)^2(\text{biased estimation}\quad E[\sigma^2_{MLE}]= \frac{N-1}{N} \sigma^2)$$
#### Unbiased vs biased
To judge the biased and unbiased estimators, observe if the following two equations hold
$$
E[\hat{\mu}]=\mu \quad E[\hat{\sigma}]=\sigma
$$
\begin{aligned}
E\left(u_{M L E}\right) &=E\left(\frac{1}{N} \sum_{i=1}^{N} x_{i}\right)=\frac{1}{N} \sum_{i=1}^{N} E\left(x_{i}\right)=\frac{1}{N} \cdot N u=u \\E\left(\sigma_{M L t}^{2}\right) &=E\left(\frac{1}{N}
\sum_{i=1}^{N}\left(x_{i}-u_{M L t}\right)^{2}\right) \\&=E\left(\frac{1}{N} \sum_{i=1}^{N}\left(x_{i}^{2}-2 x_{i} u_{M L E}+u_{M L E}^{2}\right)\right) \\&=E\left(\frac{1}{N} \sum_{i=1}^{N} x_{i}^{2}-2 u_{M L E}^{2}+u_{M L E}^{2}\right) \\& E\left(\frac{1}{N} \sum_{i=1}^{N} x_{i}^{2}-u^{2}+u^{2}-u_{M L E}^{2}\right) \\&=E\left(\frac{1}{N} \sum_{i=1}^{N} x_{i}^{2}-u^{2}\right)-E\left(\mu_{M L E}^{2}-u^{2}\right) \\&=\sigma^{2}-\left(E\left(u_{M L E)}^{2}-u^{2}\right)\right.\\&=\sigma^{2}-\left(E\left(u_{M L E)}^{2}-E^{2}\left(u_{M L E}\right)\right)\right.\\&=\sigma^{2}-\operatorname{Var}\left(u_{M L E}\right) \\&=\sigma^{2}-\operatorname{Var}\left(\frac{1}{N} \sum_{i=1}^{N} x_{i}\right) \\&=\sigma^{2}-\frac{1}{N^{2}} \sum_{i=1}^{N} V_{a r}\left(x_{i}\right)=\frac{N-1}{N} \sigma^{2}
\end{aligned}
**So the $\mu$ is a unbiased estimator and $\sigma$ is an biased estimator**
##### Remark
Maximum likelihood estimators (point estimators) are biased, with $\sigma$ being smaller for gaussian distributions
Notice that the expectation of the sample mean is equal to the expectation of the population mean, not that the sample mean is equal to the population mean
### Multidimensional Gaussian distribution
**pdf of the multidimensional Gaussian distribution**
$$
p(x|\mu,\Sigma)=\frac{1}{(2\pi)^{p/2}|\Sigma|^{1/2}}e^{-\frac{1}{2}(x-\mu)^{T}\Sigma^{-1}(x-\mu)}
$$
**Markov distance**
$$
\chi=\left(\begin{array}{c}x_{1} \\ x_{2} \\ \vdots \\ x_{p}\end{array}\right) \quad \mu=\left(\begin{array}{c}\mu_{1} \\ \mu_{2} \\ \vdots \\ \mu_{p}\end{array}\right) \\(x-\mu)^{T}\Sigma^{-1}(x-\mu): \text{markov distance(between x and $\mu$) }
\\ \text{If $\Sigma=I$, then markov distance= Euclidean distance}
$$
**Covariance**
covariance is the population error of two variables, variance is a special case of covariance when the two variables are the same.
$$
\operatorname{cov}\left(X_{1}, X_{2}\right)=E\left(X_{1}-\mu_{1}\right)\left(X_{2}-\mu_{2}\right)=E\left(X_{1} X_{2}\right)-\mu_{1} \mu_{2}
$$
If $X_1$ and $X_2$ are two independent random variables, their covariance is zero;The opposite is not true.
**Covariance matrix(positive definite and symmetric):**
$$
\sum=\left[\begin{array}{cccc}\operatorname{cov}\left(x_{1}, x_{1}\right) & \operatorname{cov}\left(x_{1}, x_{2}\right) & \ldots & \operatorname{cov}\left(x_{1}, x_{n}\right) \\\operatorname{cov}\left(x_{2}, x_{1}\right) & \operatorname{cov}\left(x_{2}, x_{2}\right) & \ldots & \operatorname{cov}\left(x_{2}, x_{n}\right) \\\vdots & \vdots & \ddots & v d o t s \\\operatorname{cov}\left(x_{n}, x_{1}\right) & \operatorname{cov}\left(x_{n}, x_{2}\right) & \ldots & \operatorname{cov}\left(x_{n}, x_{n}\right)\end{array}\right]
$$
The element in the ith row and the jth column represents the covariance of the ith and jth random variable
**Positive definite**
A real symmetric matrix M of n$\times$n is positive defintie if and only if $z^TMz>0$ for all non-zero real coefficient vectors
**Symmetric**
Since the covariance between the ith and the jth random variable is the same as the covariance between the jth and the ith
random variable, it is on that this matrix is symmetric
**For symmetric covariance matrix, eigenvalue decomposition can be performed**
so $$ \Sigma=U \Lambda U^{T}=\left(u_{1}, u_{2}, \cdots, u_{p}\right) \operatorname{diag}\left(\lambda_{i}\right)\left(u_{1}, u_{2}, \cdots, u_{p}\right)^{T}=\sum_{i=1}^{p} u_{i} \lambda_{i} u_{i}^{T} $$
Let $$ \\\Sigma^{-1}=\sum_{i=1}^{p} u_{i} \frac{1}{\lambda_{i}} u_{i}^{T} \\\Delta=(x-\mu)^{T} \Sigma^{-1}(x-\mu)=\sum_{i=1}^{p}(x-\mu)^{T} u_{i} \frac{1}{\lambda_{i}} u_{i}^{T}(x-\mu)=\sum_{i=1}^{p} \frac{y_{i}^{2}}{\lambda_{i}}$$
$y_i$ = $(x-\mu)^Tu_i$is the projection length of $x-\mu$ on the eigenvector $u_i$
For the two dimensional Gaussian distribution, the top formula is the concentric ellipse with different values on the left
$$
\frac{y_1}{\lambda_1^2}+\frac{y_2}{\lambda_2^2}=1
$$
## Limitation of Gaussian distribution
There are two problems when applying the multidimensional Gaussian distribution model
1. **the covariance matrix is too comlicated**
since covariance matrix is a <u>symmetric matrix</u>, we can calculate
<u>the number of parameters in the matrix</u>:
$\frac{p^2-p}{2}+p=\frac{p^2+p}{2}=O(p^2)$
Which causes the difficulty of calculation process.
So we can simplify the covariance matrix.
Assuming that the covariance matrix is a <u>diagonal matrix</u>(only parameters on the diagonal), we can get the simplified covariance matirx.
$$
\left(\begin{array}{lllll}
\lambda_{1} & & & \\
& \lambda_{2} & & \\
& & \lambda_{3} & \\
& & & \vdots & \\
& & & &\lambda_{p}
\end{array}\right)
$$
In this case, there is no need to factor the covariance matrix,$u_i=x_i$, then the direction of $x_i$ is the same as the direction of $y_i$. Therefore, when the dimension is two, the figure becomes a positive ellipse.
Moreover, If $\lambda_1=\lambda_2=…=\lambda_p$ ,the figure becomes a round, which is called <u>isotropic.</u>
2. **Only one Gaussian distribution may fail to show the model exactly** when the statistics is complicated.
## The edge probability and conditional probability
Divide the p-dimensional vector $x$ into two groups
we have
$$
x=\left(\begin{array}{l}
x_{a} \\
x _{b} \\
\end{array}\right)
$$
$x_a$ is a m-dimensional vector, $x_b$ is the $n$-dimensional vector, $m+n=p$
$$
\mu=\left(\begin{array}{l}
\mu_{a} \\
\mu _{b} \\
\end{array}\right)
$$
$$
\Sigma =\left(\begin{array}{2}
\Sigma_{aa} &\Sigma_{ab} \\
\Sigma_{ba} &\Sigma_{bb} \\
\end{array}\right)
$$
Firstly, we introduce one theorem
$$
X\sim N(\mu,\Sigma)
\\y=Ax+B
\\y\sim N(A\mu+B, A\Sigma A^T)
$$
\begin{aligned}
proof:\;&E[y]=E[Ax+B]=AE[x]+B=A\mu+B
\\&Var[y]=Var[Ax+B]=Var[Ax]+Var[B]
=AVar[x]A^T+0(B\;is\;a\;constant)
=A\Sigma A^T
\end{aligned}
Then, we can calculate
$${x_a\sim N(\mu,\Sigma_{aa})}$$
$$x_a=\begin{pmatrix}\mathbb{I}_{m\times m}&\mathbb{O}_{m\times n})\end{pmatrix}\begin{pmatrix}x_a \;x_b\end{pmatrix}$$
$$ \mathbb{E}[x_a]=\begin{pmatrix}\mathbb{I}&\mathbb{O}\end{pmatrix}\begin{pmatrix}\mu_a&\mu_b\end{pmatrix}=\mu_a$$
$$Var[x_a]=\begin{pmatrix}\mathbb{I}&\mathbb{O}\end{pmatrix}\begin{pmatrix}\Sigma_{aa}&\Sigma_{ab}\\\Sigma_{ba}&\Sigma_{bb}\end{pmatrix}\begin{pmatrix}\mathbb{I}&\mathbb{O}\end{pmatrix}=\Sigma_{aa}$$
so $x_a\sim N(\mu,\Sigma_{aa})$
$${E\left[x_{b \cdot a}\right]=u_{b a}}$$
$${\operatorname{Var}\left[X_{b-a}\right]=\Sigma_{b b \cdot a}}$$
There are three constructions after large calculation
$$ x_{b\cdot a}=x_b-\Sigma_{ba}\Sigma_{aa}^{-1}x_a \\
\mu_{b a}=\mu_b-\Sigma_{ba}\Sigma_{aa}^{-1}\mu_a\\
\Sigma_{bb\cdot a}=\Sigma_{bb}-\Sigma_{ba}\Sigma_{aa}^{-1}\Sigma_{ab} ( \text{ the schur complementary})
$$
$$
x_{b \cdot a}=\left(-\Sigma_{b a} \Sigma_{a a}^{-1} \quad I\right)\left(\begin{array}{l}x_{a} \\ x_{b}\end{array}\right)
$$
$$
E\left[x_{b \cdot a}\right]=\left(\begin{array}{ll}-\Sigma_{b a} \Sigma_{a c}^{-1} & I\end{array}\right)\left(\begin{array}{l}u_{a} \\ u_{b}\end{array}\right)=\mu_{b}-\sum_{b a} \Sigma_{a c}^{-1} u_{a}=u_{b a}
$$
\begin{align}
\operatorname{Var}\left[X_{b-a}\right]&=\left(\begin{array}{ll}-\Sigma_{b a} \Sigma_{a c}^{-1} & I)\end{array}\left(\begin{array}{cc}\Sigma_{c a} & \Sigma_{a b} \\ \Sigma_{b a} & \Sigma_{b b}\end{array}\right) \left(\begin{array}{c}-\Sigma_{a a}^{-1} \Sigma_{b a}^{\top} \\ I\end{array}\right)\\
&=\left(\begin{array}{cc}0 & \Sigma_{b b}-\Sigma_{b a} \Sigma_{a a}^{-1} \Sigma_{a b}\end{array}\right)\left(\begin{array}{c}-\Sigma_{0 a}^{-1} \Sigma_{b a}^{\top} \\ I\end{array}\right)\\
&=\Sigma_{b b}-\Sigma_{b a} \Sigma_{a a}^{-1} \Sigma_{a b}\\
&=\Sigma_{b b \cdot a}
\end{align}
3. $\pmb{x_{b} \mid x_{a} \sim N\left(u_{b a}+\Sigma_{b a} \Sigma_{a a^{-1}}+x_{a}, \Sigma_{b b a }\right)}$
Firstly, we introduce how to prove that two variables are independent
$If \;x\sim N(\mu, \Sigma), \quad then \quad M x \perp N x \Leftrightarrow M \Sigma N^{\prime}=0$
$\begin{aligned} \because & \because x \sim N(\mu, \Sigma) \\ \therefore & M x \sim N\left(M \mu, M \Sigma M^{\top}\right) \\ & N x \sim N\left(N \mu, N \Sigma N^{\top}\right) \\=& \operatorname{Cov}(M x, N x) \\=& E\left[(M x-M \mu)(N x-N \mu)^{\top}\right] \\=& E\left[M(x+\mu) \cdot(x+\mu)^{\top} N\right] \\=&\left.M \cdot E[x+1)(x+\mu)^{\top}\right] \cdot N \\=& M \Sigma N^{\top} \end{aligned}$
Then, we prove $x_a$ and $x_b$ are independent
\begin{aligned}
&\Sigma=\left(\begin{array}{ll}
\Sigma_{a a} & \Sigma_{a b} \\
\Sigma_{b a} & \Sigma_{b b}
\end{array}\right)\\
&x_{b_{a}}=x_{b}-\Sigma_{b a} \Sigma_{a+}^{-1} x_{a}\\
&=\left(\Sigma_{b a} \Sigma_{a a}^{1}I\right)\left(\begin{array}{l}
x_{a} \\
x_{b}
\end{array}\right)=M x\\
&x_{a}=\left(\begin{array}{ll}
I & 0
\end{array}\right)\left(\begin{array}{l}
x_{a} \\
x_{b}
\end{array}\right)=N x\\
&M \Sigma N^{\top}=\left(-I_{b a} \Sigma_{a a}^{-1} \quad I\right)\left(\begin{array}{ll}
\Sigma_{a a} & \tau_{b b} \\
\Sigma_{b a} & \tau_{b b}
\end{array}\right)\left(\begin{array}{c}
I \\
0
\end{array}\right)=0
\end{aligned}
so $x_a$ and $x_b$ are independent
since $x_{b}=x_{b \cdot a}+\Sigma_{b a} \Sigma_{a a}^{-1} x_{a}$
$$
\mathbb{E}[x_b|x_a]=\mu_{b\cdot a}+\Sigma_{ba}\Sigma_{aa}^{-1}x_a
$$
$\operatorname{Var}\left[x_{b} \mid x_{a}\right]=\Sigma_{b b \cdot a}$
Therefore,
$$x_{b} \mid x_{a} \sim N\left(u_{b a}+\Sigma_{b a} \Sigma_{a a^{-1}}+x_{a}, \Sigma_{b b a }\right)$$
## joint probability
$\text { we have } p(x)=\mathcal{N}\left(\mu, \Lambda^{-1}\right), p(y \mid x)=\mathcal{N}\left(A x+b, L^{-1}\right) $
$\Lambda^{-1}$ is a precision matrix,which is the inverse matrix of covariance matrix
Assume $y=A x+b+\epsilon, \epsilon \sim \mathcal{N}\left(0, L^{-1}\right)$,
$$
\mathbb{E}[y]=\mathbb{E}[A x+b+\epsilon]=A \mu+b, \quad \operatorname{Var}[y]=A \Lambda^{-1} A^{T}+L^{-1}
$$
Therefore
$$
y\sim\mathcal{N}\left(A \mu+b, L^{-1}+A \Lambda^{-1} A^{T}\right)
$$
Let $z=\left(\begin{array}{l}x \\ y\end{array}\right)$, we can get
$$
\operatorname{Cov}[x, y]=\mathbb{E}\left[(x-\mathbb{E}[x])(y-\mathbb{E}[y])^{T}\right]
$$
$$
\operatorname{Cov}(x, y)=\mathbb{E}\left[(x-\mu)(A x-A \mu+\epsilon)^{T}\right]=\mathbb{E}\left[(x-\mu)(x-\mu)^{T} A^{T}\right]=\operatorname{Var}[x] A^{T}=\Lambda^{-1} A^T
$$
subsititute what we have calculated into the formula,we have
$$
\begin{gathered}
\mathbb{E}[x \mid y]=\mu+\Lambda^{-1} A^{T}\left(L^{-1}+A \Lambda^{-1} A^{T}\right)^{-1}(y-A \mu-b) \\
\operatorname{Var}[x \mid y]=\Lambda^{-1}-\Lambda^{-1} A^{T}\left(L^{-1}+A \Lambda^{-1} A^{T}\right)^{-1} A \Lambda^{-1}
\end{gathered}
$$
# Linear Regression
## Least Squre Method
<img src="/Users/baoshanzi/Library/Application Support/typora-user-images/image-20220719204024765.png" alt="image-20220719204024765" style="zoom:33%;" />
$$
f(x)=w^Tx+b
$$
### **Matrix form**
Assume the data above is
$$D= {(x_1,y_1),(x_2,y_2),\cdots,(x_n,y_n)}$$
Then we let
$$
X=(x_1,x_2,x_3,\cdots,x_n)^T
$$
We assume that we can fit the data points into a function
$$f(w)=w^Tx$$
The loss function is
$$L(w)=\sum_{i=1}^n||(w^Tx_i-y_i)||^2$$
Expand the above equation to obtain
\begin{align}
L(w)&=(w^Tx_1-y_1,...,(w^Tx_n-y_n))*(w^Tx_1-y_1,...,(w^Tx_n-y_n))^T\\
&=(w^TX^T-Y^T)*(Xw-Y)\\
&=w^TX^TXw-2w^TX^TY+Y^TY\\
\end{align}
To Minimize the w
$$w=argminL(w)$$
$\frac{\partial}{\partial w}L(w)=0$
$$2X^TXw-2X^TY=0$$
$$w=(X^TX)^-1X^TY$$
We call $(X^TX)^-1X^T$ as the pseudo inverse
### In a geometric sense
The first interpretation is to view the model as a straight line and find the sum of squares of the distances from each data point to the straight line
<img src="/Users/baoshanzi/Library/Application Support/typora-user-images/image-20220719204409323.png" alt="image-20220719204409323" style="zoom:33%;" />
Assuming that our test sample spans a p-dimensional space (the case of full rank matrix) and the model can be written as some combination in the above linear space, the least square method means that the distance between Y and the model is as small as possible, so their difference should be perpendicular to the span space.
### Regularization
When the unit sample data is too much and the sample number is too small, it will cause overfitting
In the case of matrices,$X_{n*p}$,$n$ is the number of samples,$x_i \in R^p$
when $n\gg p$,this will result in overfitting, or in the mathematical sense, there is no inverse matrix.
In order to prevent over-fitting, we can adopt the means of adding data, reducing dimension or regularization
Regularization framework**(frequentist)**:
$$
\underset{w}{\operatorname{argmax}}[L(w)+ \lambda P(w)]
$$
L1 Lasso:$P(w)=||w||$
L2 Ridge(岭回归)$P(w)=||w||^2=w^Tw$
For Ridge Reggression
$$
\hat{w}= \underset{w}{\operatorname{argmax}}[L(w)+\lambda w^Tw]\\=\underset{w}{\operatorname{argmax}}[w^T(X^TX+\lambda I)w-2w^TX^TY+Y^TY
$$
$$
\rightarrow \frac{\partial}{\partial w}L(w)+2\lambda w=0\\\rightarrow 2X^TX\hat w-2X^TY+2\lambda\hat w=0\\\rightarrow\hat w=(X^TX+\lambda I)^{-1}X^TY
$$
Regularization using 2 norm can not only select the smaller parameters of $w $, but also avoid the irreversible problem of $X^TX$.
From Bayes' point of view:
$w$~$N(0,\sigma_0^2)$
$$
P(w|y)=\frac {P(y|w)P(w)}{P(y)}
$$
MAP:
$$
\hat w =\underset{w}{\operatorname{argmax}}P(w|y)\\=\underset{w}{\operatorname{argmax}}P(y|w)P(w)\\=\underset{w}{\operatorname{argmax}}log[P(y|w)P(w)]
$$
$$
\hat w =\underset{w}{\operatorname{argmax}}\sum^N_{i=1}(y_i-w^Tx_i)^2+\frac {\sigma^2}{\sigma^2_0}||w||^2_2
$$
### Remark
For linear models that do not fit data samples well, we have the following methods to help fit:
After the linear equation, a nonlinear transformation is added, that is, a nonlinear activation function is introduced, typical of linear classification models such as perceptrons.
### Summary
LSE(Least squares estimation)$\Leftrightarrow$MLE(Maximum likelihood estimation) whose noise is Gaussian distribution
Rugularized LSE $\Leftrightarrow$MAP(Maximum a posteriori) whose noise is Gaussian distribution
# Linear Classification
![3761658235000_.pic](/Users/baoshanzi/Library/Containers/com.tencent.xinWeChat/Data/Library/Application Support/com.tencent.xinWeChat/2.0b4.0.9/e30b40cd407727f4d32ae520fc29abaa/Message/MessageTemp/0d522068fe5b4b3ec01fadea62e2d316/Image/3761658235000_.pic.jpg)
The linear regression model cannot complete the classification task, but we can add a layer of nonlinear activation function after the function of the linear model, and the inverse function of the activation function is called link function.
We have two linear classification modes: hard classification and soft classification.
## Hard Classification
### Perceptron
Main idea: error driven
Model: $f(x)=sign(W^Tx),x\in R^P,W\in R^P$
$$sign(a)=\begin{cases}+1 & \text{if } a < 0,\\-1 & \text{if } a > 0.\end{cases}$$
Define D:{the sample that have been misclassified}
Define the loss function:
$$L(w)=\sum_{i=1}^nI\left\{ W^Tx_iy_i\lt0\right\}$$
Since the above function is discontinuous and non-differentiable, we convert the loss function to the following function
$$L(w)=\underset{x_i\in D}\sum-W^Tx_iy_i$$
The gradient with respect to w is zero:
$$\nabla_wL=-y_ix_i$$
Use *SGD* Random gradient descent method:
$w^{t+1}\leftarrow w^t-\lambda\nabla_wL$
According to Taylor's formula, w converges in the negative direction.The simultaneous use of a single observation update can also increase the uncertainty to a certain extent, thus reducing the possibility of falling into a local minimum.
### Fisher Discriminant Analysis(LDA Linear Discriminant Analysis)
The application of LDA requires the projection of experimental samples to meet the following two conditions:
1. The test samples within the same class are close to each other.
2. The distance between different categories is large
Projection:
Let's say the original data is x, so the projection along w is a scalar
$$
Z_i=W^Tx_i
$$
$$
\overline{Z}=\frac{1}N\sum_{i=1}^NZ_i=\frac{1}N\sum_{i=1}^NW^Tx_i
$$
The number of tests of the two types of samples was denoted as N1 and N2. The variance matrix was used to represent the overall distribution within the class, and S was used to represent the covariance of the original data:
$$
S_Z=\frac{1}N\sum_{i=1}^N(Z_i-\overline Z)(Z_i-\overline Z)^T\\=\frac{1}N\sum_{i=1}^N(W^Tx_i-\overline Z)(W^Tx_i-\overline Z)^T
$$
$$
Z_1=\frac{1}N_1\sum_{i=1}^{N_1}W^Tx_i
$$
$$
Z_2=\frac{1}N_2\sum_{i=1}^{N_2}W^Tx_i
$$
$$
S_1=\frac{1}{N_1}\sum_{i=1}^{N_1}(W^Tx_i-\overline Z_1)(W^Tx_i-\overline Z_1)^T
$$
$$
S_2=\frac{1}{N_2}\sum_{i=1}^{N_2}(W^Tx_i-\overline Z_2)(W^Tx_i-\overline Z_2)^T
$$
+ Between class:$(\overline Z_1 -\overline Z_2)^2$
+ In the class:$S_1+S_2$
Using the above two distance calculations, and since the covariance is a matrix, we divide these two values to get our loss function and maximize it:
$$
J(w)=\frac{(\overline Z_1 -\overline Z_2)^2}{S_1+S_2}
$$
$$
\hat{w}= \underset{w}{\operatorname{argmax}}J(w)\\=\underset{w}{\operatorname{argmax}}\frac{(\overline Z_1 -\overline Z_2)^2}{S_1+S_2}\\=\underset{w}{\operatorname{argmax}}\frac{W^TS_bW}{W^TS_wW}
$$
$S_b$ stands for variance between classes
$S_w$ stands for variance within clusters
So let's take the partial of this loss function,And since we only need to figure out the direction of the projection, we only have to solve one equation to figure out the direction:
$$
\frac{\partial}{\partial w}J(w)=2S_bW(W^TS_wW)^{-1}-2W^TS_bW(W^TS_wW)^{-2}S_wW=0\Rightarrow
$$
$$
\Rightarrow S_bW(W^TS_wW)=(W^TS_bW)S_wW
$$
$$
\Rightarrow w\propto S_w^{-1}S_bw=S_w^{-1}(\overline x_{c1}-\overline x_{c2})(\overline x_{c1}-\overline x_{c2})^T\propto S_w^{-1}(\overline x_{c1}-\overline x_{c2})
$$
Then, $S_w^{-1}(\overline x_{c1}-\overline x_{c2})$ is the direction we wanted
## Soft Classification
### Probabilistic discriminant model-Logistic Regression
In order to output the interval [0,1] value through the function, we use the logistic regression to help.
Data:$\left\{ x_i,y_i\right\}_{i=1}^N ,x_i \in R^P,y_i\in\left\{0,1\right\}$
$\sigma(Z)=\frac{1}{1+e^{-x}}$
The above formula is called the Logistic Sigmoid function, and its parameter represents the logarithm of the ratio of the two types of joint probability. In the discriminant, we don't care about the exact value of this parameter, and the model assumes that we apply a directly。
$Z\rightarrow \infty$,$lim\sigma(Z)=1$
$Z\rightarrow 0$,$lim\sigma(Z)=0.5$
$Z\rightarrow -\infty$,$lim\sigma(Z)=0$
$\sigma:R\rightarrow(0,1)$
The Logistic regression model assumes that:
$w^Tx\rightarrow p$
Thus, the optimal model under this model hypothesis can be obtained by looking for the optimal value of w. Probabilistic discriminant models usually use MLEto determine the parameters.
$$
p(y|x)=p_1^yp_0^{1-y}
$$
The observed MLE for N times of independent homodistribution is:
$$
\hat{w}= \underset{w}{\operatorname{argmax}}J(w)=\underset{w}{\operatorname{argmax}}\sum_{i=1}^N(y_ilogp_1+(1-y_i)logp_0)
$$
Take the derivative of this function ,we notice that:
$$
p_1'=(\frac{1}{1+e^{-x}})’=p_1(1-p_1)
$$
$$
J'(w)=\sum_{i=1}^Ny_i(1-p_1)x_i-p_1x_i+y_ix_ip_1\\=\sum_{i=1}^N(y-p_1)x_i
$$
Because of the nonlinearity of the probability values, this formula cannot be solved directly when you put it in summation notation.
As the same as perceptron, the stochastic gradient descent method is used to find the extreme value
### Probabilistic generation model- Gaussian Determinant Analysis (GDA)
For dichotomies, we make the following assumptions:
y~$Bernoulli(\phi)$
x|y=1 ~ $N(\mu_1,\xi)$
x|y=0 ~ $N(\mu_2,\xi)$
Log-likelihood: $L(\theta)=log\prod_{i=1}^Np(x_i,y_i)$
$$
L(\theta)=log\prod_{i=1}^Np(x_i,y_i)\\=\sum _{i=1}^N[logN(\mu_1,\xi)^{y_i}+logN(\mu_2,\xi)^{1-y_i}+log\phi^{y_i}(1-\phi)^{1-y_i}]\\=\sum_{i=1}^Ny_ilogN(\mu_1,\xi)+\sum_{i=1}^N(1-y_i)logN(\mu_2,\xi)+\sum_{i=1}^Nlog\phi^{y_i}(1-\xi)^{1-y_i}
$$
$$
\theta=(\mu_1,\mu_2,\xi,\phi)
$$
we take $\sum_{i=1}^Ny_ilogN(\mu_1,\xi) \rightarrow 1^*$
$$
\sum_{i=1}^N(1-y_i)logN(\mu_2,\xi)\rightarrow 2^*
$$
We want find $\xi$
$$
\xi=\underset{\xi}{\operatorname{argmax}}(1^*+2^*)
$$
$$
1^*+2^*=\sum_{i=1}^Ny_ilogN(\mu_1,\xi)+\sum_{i=1}^N(1-y_i)logN(\mu_2,\xi)
$$
To calculate the equation, we can firstly find the characteristic of
$$
\sum_{i=1}^NlogN(\mu,\xi)\\=\sum_{i=1}^Nlog\frac{1}{(2\pi)^{\frac{2}{p}}|\xi|^{\frac{1}{2}}}exp[-0.5(x-\mu)^T\xi(x-\mu)]
$$
And just to simplify things, we take$\sum_{i=1}^Nlog\frac{1}{2\pi^{\frac{p}{2}}}$as C
$$
=\sum_{i=1}^NC-0.5log|\xi|-0.5\sum_{i=1}^N(x_i)^T\xi^{-1}(x_i-\mu)
$$
take. $s=\frac{1}{N}\sum_{i=1}^N(x_i-\mu)(x_i-\mu)^T$
$$
=-0.5Nlog|\xi|-0.5Ntr(s*\xi^-1)+C
$$
We have the following properties
$$
\frac{\partial}{\partial A}(|A|)=|A|A^{-1}\\\frac{\partial}{\partial A}Trace(AB)=B^T
$$
$$
[\sum_{i=1}^Ny_ilogN(\mu_1,\xi)+\sum_{i=1}^N(1-y_i)logN(\mu_2,\xi)]’\\=
-\frac{1}{2}(Nlog|\xi|+N_1tr(s_1\xi^-1)+N_2tr(s_2\xi^-1)+C
$$
$$
N\xi^{-1}-N_1S_1^T\xi^{-2}-N_2S_2^T\xi^{-2}=0
$$
$$
\rightarrow \xi=\frac{N_1S_1+N_2S_2}{N}
$$
We use maximum posteriori method to find all the parameters in the hypothesis of our model, and from the model, we can get the joint distribution, the conditional distribution for inference.
##### Remark:
In Gauss Discriminant Analysis, the assumption of Gauss distribution is made on the distribution of the data set, and Bernoulli distribution is introduced as a priori, so as to obtain the parameters in these assumptions by using the maximum posteriori.
### Naive Bayes Classification
Assumptions are made about the relationships between attributes of naive Bayes data,one of the simplest assumptions is the conditional independence hypothesis in the description of naive Bayes model.
$$
p(x|y)=\prod p(x_i|y)
$$
Or
$$
x_i\perp x_j(i\neq j)
$$
**The purpose of the hypothesis: to simplify operations**
$$
\hat{y}= \underset{y}{\operatorname{argmax}}p(y|x)\\=\underset{y=\left\{0,1\right\}}{\operatorname{argmax}\frac{p(x,y)}{p(x)}}\\=\underset{y}{\operatorname{argmax}}p(y)p(x|y)
$$
We make the following further assumption
- 0,1 y~Bernoulli
- In multi-class classification,y~ Categorial
- if x~i~ is continues,$p(x_i|y)=N(\mu_i,\sigma_i^2)$
- if x~i~ is discrete,x~Categorical
**Remark:**
The assumption of conditional independence greatly reduces the amount of data required