2PPDandZTNB.Rmd

---
title: "The finite two parameter Poisson Dirichlet is just the Negative Binomial"
author: "Timothy Daley"
date: "June 25, 2015"
output: html_document
---

The two parameter Poisson Dirichlet with a finite population (F2PD) is a stochastic sampling process defined by the following rules: given existing species counts $x_{1} (N), \ldots, x_{D} (N)$ after sampling $N$ individuals, the probability the $(N + 1)$th individual will belong to the $i$th class for $i = 1, \ldots, D$ is 
$$
\Pr (X_{N + 1} = i) = p_{i}(N) = \frac{x_{j}(N) + \kappa}{N + \kappa S}
$$
The probability the $(N + 1)$ individual belongs to a new class is 
$$
q (N) = \frac{\kappa (S - D)}{N + \kappa S}.
$$
Here, $S$ is the size of the underlying population and $\kappa$ is positive parameter.  See [Hansen and Pitman, *Statistics and Probability Letters*, 2000](http://digitalassets.lib.berkeley.edu/sdtr/ucb/text/520.pdf) for details.  

A remark in the above paper really stood out to me: 
> "In the case above, the distribution of the exchangeable sequence $(X_{n})$ is identical to that generated by sampling from $F := \sum_{i = 1}^{S} p_{i} 1( \hat{X}_{i} \in \cdot)$, where $(p_{1}, \ldots, p_{S})$ has a symmetric Dirichlet distribution with parameter $\kappa$."  
Note that if $Y_{1}, \ldots, Y_{S}$ are iid Gamma$(\kappa)$ distributed, then the normalized vector $(Y_{1}/ \sum_{i = 1}^{S} Y_{i}, \ldots , Y_{S} / \sum_{i = 1}^{S} Y_{i})$ is Dirichlet distributed.  Therefore the species counts generated by the rules above will approximately follow a Negative Binomial distribution and the observed counts will approximately follow a zero-truncated Negative Binomial distribution (ZTNB).

Let's test this out.  I wrote the following Python script to sample according to the F2PD.
```{r engine='sh'}
tail -n 74 FinitePoissonDirichletSampler.py
```

I sample 100 samples of 10,000 individuals from a F2PD with parameter $\kappa = 1.0$ and population size $S = $10,000.  This should result in a ZTNB with mean $\mu = 1$ and size $k = 1$.   
```{r engine='sh', eval=FALSE}
for i in $(seq 1 100); do python FinitePoissonDirichletSampler.py -k 1.0 -m 10000 -n 10000 -o F2PDsample_${i}.txt; done
```
I use the preseqR function preseqR.ztnb.em to fit the negative binomial distribution to the observed counts.
```{r results='hide', message=FALSE, cache=TRUE}
install.packages("preseqR", repos = "http://cran.us.r-project.org")
library(preseqR)
```
```{r cache=TRUE}
F2PDsample_mu = vector(length = 100)
F2PDsample_size = vector(length = 100)
F2PDsample_pop_size = vector(length = 100)
for(i in 1:100){
  F2PDsample = read.table(file = paste("F2PDsample_", i, ".txt", sep=""))
  F2PDsample_hist = hist(F2PDsample[,1], breaks = 0:max(F2PDsample[,1]), plot=FALSE)
  F2PDsample_hist = cbind(F2PDsample_hist$breaks[2:length(F2PDsample_hist$breaks)], F2PDsample_hist$counts)
  F2PDsample_ztnb_fit = preseqR.ztnb.em(F2PDsample_hist)
  F2PDsample_mu[i] = F2PDsample_ztnb_fit$mu
  F2PDsample_size[i] = F2PDsample_ztnb_fit$size
  F2PDsample_pop_size[i] = length(F2PDsample[,1])/(1 - dnbinom(0, size = F2PDsample_ztnb_fit$size, mu = F2PDsample_ztnb_fit$mu))
}
```

The parameters are distributed as follows.
```{r}
boxplot(F2PDsample_mu)
boxplot(F2PDsample_size)
```
The population size is distributed as
```{r}
boxplot(F2PDsample_pop_size)
```

Other simulations have had similar results. So it turns out the Finite two parameter Poisson Dirichlet process is useless to simulate from.  The infinite two parameter Poisson Dirichlet process is used quite a bit and I always wondered why the finite version wasn't.  Now I know.