-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathseparating_soft_clips_start_and_end.Rmd
233 lines (189 loc) · 7.41 KB
/
separating_soft_clips_start_and_end.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
---
title: "separating soft clips at the start and at the end"
output: html_document
date: "2023-04-24"
editor_options:
chunk_output_type: console
---
This script is about separating 3' and 5' end with and without soft clips. It starts by taking in the output file of parsing_cigar script as the input.
loading libraries
```{r warning=FALSE, message=FALSE}
suppressPackageStartupMessages({
library(tidyverse)
library(GenomicFeatures)
library(rtracklayer)
library(dplyr)
library(Rsamtools)
library(glmnet)
library(ggpubr)
library(visreg)
library(DESeq2)
library(apeglm)
})
```
Finding soft clips only at the 3' end
using subset_final (that has come from parsing_cigar script)
counting number of S in the cigar- it will be either 1 or 2. If 1 then soft clip is present either at the start or at the end.
if 2 then soft clip is present at both the ends.
Making a new column s_count for the count of S in CIGAR.
```{r warning=FALSE, message=FALSE}
subset_final$s_count=str_count(subset_final$Cigar, "S")
```
Finding the soft clips at the start and at the end
```{r warning=FALSE, message=FALSE}
soft_clip_both <- subset_final %>%
filter(s_count=="2")
```
applying matcher function- no change in the function, its the same as was used for parsing cigar
```{r warning=FALSE, message=FALSE}
matcher <- function(pattern, x) {
ind = gregexpr(pattern, x)[[1]]
start = as.numeric(ind)
end = start + attr(ind, "match.length") -2
apply(cbind(start,end), 1, function(y) substr(x, start=y[1], stop=y[2]));
}
```
Doone_start function- this function is finding S at the start (5'end)
```{r warning=FALSE, message=FALSE}
doone_start <- function(c, Cigar) {
pat <- paste("\\d+", c , sep="")
as.numeric(matcher(pat, Cigar))[1]
}
```
Cigarsums is finding S
```{r warning=FALSE, message=FALSE}
cigarsums <- function(Cigar, chars=c("S")) {
sapply (chars, doone_start, Cigar)
}
```
Soft clip at the start of the CIGAR string
```{r warning=FALSE, message=FALSE}
addcig_start <- soft_clip_both %>%
rowwise %>% mutate(SoftClipStart = cigarsums(Cigar))
```
This function finding S at the end of the CIGAR string
```{r warning=FALSE, message=FALSE}
doone_end <- function(c, Cigar) {
pat <- paste("\\d+", c , sep="")
as.numeric(matcher(pat, Cigar))[2]
}
```
Cigarsums finding S
```{r warning=FALSE, message=FALSE}
cigarsums <- function(Cigar, chars=c("S")) {
sapply (chars, doone_end, Cigar)
}
```
Adding column that has soft clip end at the end (3' end)
```{r warning=FALSE, message=FALSE}
addcig_end <- soft_clip_both %>%
rowwise %>% mutate(SoftClipEnd = cigarsums(Cigar))
```
Merging both the start and end soft clip dataframe
```{r warning=FALSE, message=FALSE}
c <- merge(x=addcig_start, y=addcig_end, all=T)
```
De-selecting the s_count and sequence column as they are not required further.
```{r warning=FALSE, message=FALSE}
merge_soft_clip_both <- c %>%
dplyr::select(-s_count, -Sequence)
```
Soft clip that are either at the start or at the end (s_count=1), we don't know whether its at the start or at the end
```{r warning=FALSE, message=FALSE}
soft_clip_one <- subset_final %>%
filter(s_count == "1")
```
To check if soft clip is at the start we are making another column of called logic which gives output as True or False.
To filter out the soft clips at either end we have to convert the logic column as an integer. (True = 1 and False = 0)
```{r warning=FALSE, message=FALSE}
soft_clip_one$logic <- grepl("S$", soft_clip_one$Cigar)
soft_clip_one$logic <- as.integer(as.logical(soft_clip_one$logic))
```
Filtering out the soft clips that are present only at the end.
```{r warning=FALSE, message=FALSE}
soft_clip_one_end <- soft_clip_one %>%
filter(logic=="1")
```
calculating the length of the soft clip at the end using original doone function
```{r warning=FALSE, message=FALSE}
doone <- function(c, Cigar) {
pat <- paste("\\d+", c , sep="")
sum(as.numeric(matcher(pat, Cigar)), na.rm = T)
}
```
Finding S in the CIGAR string
```{r warning=FALSE, message=FALSE}
cigarsums <- function(Cigar, chars=c("S")) {
sapply (chars, doone, Cigar)
}
```
Making a separate column for Soft clip at the end and deselecting the s_count, logic and sequence.
```{r warning=FALSE, message=FALSE}
addcig_one_end <- soft_clip_one_end %>%
rowwise %>% mutate(SoftClipEnd = cigarsums(Cigar)) %>%
dplyr::select(-s_count, -logic, -Sequence)
```
Filtering out soft clips present at the start
```{r warning=FALSE, message=FALSE}
soft_clip_one_start <- soft_clip_one %>%
filter(logic=="0")
```
calculating the length of the soft clip at the start by applying the original doone function
```{r warning=FALSE, message=FALSE}
doone <- function(c, Cigar) {
pat <- paste("\\d+", c , sep="")
sum(as.numeric(matcher(pat, Cigar)), na.rm = T)
}
```
Cigarsums is fincing S in the Cigar string
```{r warning=FALSE, message=FALSE}
cigarsums <- function(Cigar, chars=c("S")) {
sapply (chars, doone, Cigar)
}
```
Making a separate column for the length of soft clip at the start and deselecting the sequence, s_count and logic columns
```{r warning=FALSE, message=FALSE}
addcig_one_start <- soft_clip_one_start %>%
rowwise %>% mutate(SoftClipStart = cigarsums(Cigar)) %>%
dplyr::select(-Sequence, -s_count, -logic)
```
After separating the Soft clips of both the ends, there are few reads that have no soft clips
Reads that do not have soft clip
```{r warning=FALSE, message=FALSE}
soft_clip_zero <- subset_final %>%
filter(s_count=="0")
zero <- soft_clip_zero %>%
dplyr::select(-s_count, -Sequence)
```
Merging the heavily degraded dataset with all the columns of calculations combined.
```{r warning=FALSE, message=FALSE}
heavily_degraded_1 <- merge(x=addcig_one_end, y=addcig_one_start, all=T)
b <- merge(x=heavily_degraded_1, y=zero, all=T)
heavily_degraded_rep_2 <- merge(merge_soft_clip_both, y=b, all=T)
```
Merging undegraded dataset with all the columns of calculations combined.
```{r warning=FALSE, message=FALSE}
undegraded_1 <- merge(x=addcig_one_end, y=addcig_one_start, all=T)
b <- merge(x=undegraded_1, y=zero, all=T)
undegraded_rep_2 <- merge(x=merge_soft_clip_both, y=b, all=T)
```
Merging mildly degraded dataset with all the columns of the calculations combined.
```{r warning= FALSE, message=FALSE}
mildly_degraded_1 <- merge(x=addcig_one_end, y=addcig_one_start, all=T)
b <- merge(x=mildly_degraded_1, y=zero, all=T)
mildly_degraded_rep_2 <- merge(x=merge_soft_clip_both, y=b, all=T)
```
Summation of 3' with no soft clip with soft clip end to get 3' end with soft clip column for all three conditions.
```{r warning=FALSE, message=FALSE}
heavily_degraded_rep_2$end_with_sc <- heavily_degraded_rep_2$end_without_sc + heavily_degraded_rep_2$SoftClipEnd
undegraded_rep_2$end_with_sc <- undegraded_rep_2$end_without_sc + undegraded_rep_2$SoftClipEnd
mildly_degraded_rep_2$end_with_sc <- mildly_degraded_rep_2$end_without_sc + mildly_degraded_rep_2$SoftClipEnd
```
StartCoord column has 5' end with soft clip.
Removing soft clip from the start coord to get start without sc for all three conditions.
```{r warning=FALSE, message=FALSE}
heavily_degraded_rep_2$start_without_sc <- heavily_degraded_rep_2$StartCoord - heavily_degraded_rep_2$SoftClipStart
undegraded_rep_2$start_without_sc <- undegraded_rep_2$StartCoord - undegraded_rep_2$SoftClipStart
mildly_degraded_rep_2$start_without_sc <- mildly_degraded_rep_2$StartCoord - mildly_degraded_rep_2$SoftClipStart
```
The output of the script is all the columns added to the data with and without soft clips at both the ends.