Skip to content

Latest commit

 

History

History
59 lines (51 loc) · 3.2 KB

README.md

File metadata and controls

59 lines (51 loc) · 3.2 KB

##Show how the Rsamtools pileup function could be improve to include details about indels

library(Rsamtools)
bam_file <- "test.bam"
sbp <- ScanBamParam(which = GRanges("chr17", IRanges(7578425, 7578425)))
p_param <- PileupParam(max_depth = 30000, min_base_quality = 0, include_insertions = TRUE)
res <- pileup(bam_file, scanBamParam = sbp, pileupParam = p_param)
res
##    seqnames     pos strand nucleotide count           which_label
## 1     chr17 7578425      +          +     7 chr17:7578425-7578425
## 2     chr17 7578425      -          +    14 chr17:7578425-7578425
## 3     chr17 7578425      +          -    61 chr17:7578425-7578425
## 4     chr17 7578425      -          -     2 chr17:7578425-7578425
## 5     chr17 7578425      +          A    11 chr17:7578425-7578425
## 6     chr17 7578425      -          A     5 chr17:7578425-7578425
## 7     chr17 7578425      +          C    21 chr17:7578425-7578425
## 8     chr17 7578425      -          C    23 chr17:7578425-7578425
## 9     chr17 7578425      +          G     7 chr17:7578425-7578425
## 10    chr17 7578425      -          G     5 chr17:7578425-7578425
## 11    chr17 7578425      +          T  4866 chr17:7578425-7578425
## 12    chr17 7578425      -          T  6273 chr17:7578425-7578425

Note that first four lines don't give details about the actual insertions and deletions. Samtools pilepup actually gives this information and it would be good to have them (optionally) here too. For example here:

  • The 7 insertions on the + strand consist of 6 insertions of a G and 1 insertion of TCA.
  • The 14 insertions on the - strand consist of 11 deletions of a G, 2 of a A and one of a C.
  • The 61 deletions on the + strand consist of 41 insertions of TGT, 18 of a T and 2 of TG.
  • The 2 deletions on the - strand consist of 2 deletions of a T.

So one possible way would be to output a data frame like this containing the full data:

##    seqnames     pos strand nucleotide count           which_label
## 1     chr17 7578425      +         +G     6 chr17:7578425-7578425
## 2     chr17 7578425      +       +TCA     1 chr17:7578425-7578425
## 3     chr17 7578425      -         +G    11 chr17:7578425-7578425
## 4     chr17 7578425      -         +A     2 chr17:7578425-7578425
## 5     chr17 7578425      -         +C     1 chr17:7578425-7578425
## 6     chr17 7578425      +       -TGT    41 chr17:7578425-7578425
## 7     chr17 7578425      +         -T    18 chr17:7578425-7578425
## 8     chr17 7578425      +        -TG     2 chr17:7578425-7578425
## 9     chr17 7578425      -         -T     2 chr17:7578425-7578425
## 51    chr17 7578425      +          A    11 chr17:7578425-7578425
## 61    chr17 7578425      -          A     5 chr17:7578425-7578425
## 71    chr17 7578425      +          C    21 chr17:7578425-7578425
## 81    chr17 7578425      -          C    23 chr17:7578425-7578425
## 91    chr17 7578425      +          G     7 chr17:7578425-7578425
## 10    chr17 7578425      -          G     5 chr17:7578425-7578425
## 11    chr17 7578425      +          T  4866 chr17:7578425-7578425
## 12    chr17 7578425      -          T  6273 chr17:7578425-7578425

Note that the last 8 lines are just the same as above, only the indels lines now show more details.