getting mostly NaN values for affinity calculations #3

chrisclarkson · 2019-01-10T10:36:47Z

Hi,
I am interested in using your package to calculating the affinity of predicted binding sites and subsequently the significance of the affinity values.

My pipeline (using your instructions from https://rdrr.io/github/matthuska/tRap/man/) is as follows:

pfm
1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
A 2  3  5  1  0 13  1  3 18  0  6  1  0  1  1  6  1  2  6  2
C 8  3  2 17 19  0 12 10  1  0  0  1  0  1 18  0 12  7  6  5
G 2 10 12  1  0  5  7  1  1 20 14 14 20 18  0 14  5  3  7  7
T 8  4  2  1  0  2  0  6  0  0  0  5  0  1  1  0  1  8  1  6

pwm <- toPWM(pfm)
pwm=PWMatrix(ID="Unknown", name=tf, matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=pwm, pseudocounts=numeric())
peaks = searchSeq(pwm, seq, min.score = "80%",mc.cores=10L)
peaks_bed = as(peaks, "GRanges")

head(as.data.frame(peaks_bed$siteSeqs)$x)
[1] "AGCCCACTAGGGTGCAGTCC" "ATACCAGAAGAAGGCATCAG" "ACACCAGAAGAGGGCGTCAG"
[4] "ATGCCACGAGGTGGAGATAA" "GACTCACTAGAGGGCACAGG" "TCTACAGCAGGTGGCAACAC"

af=affinity(normalize.pwm(pwm@profileMatrix), as.data.frame(peaks_bed$siteSeqs)$x)

However this results in a many NaN values....

sum(af=='NaN')/length(af)
[1] 0.4785195

I was advised that tRap is used on long sequences rather than short ones so I extended the sequences:

start(peaks_bed) <- start(peaks_bed) - 30
end(peaks_bed) <- end(peaks_bed) + 30
extended_seqs <- getSeq(Mmusculus, peaks_bed)
af_ext=affinity(normalize.pwm(pwm@profileMatrix),as.data.frame(extended_seqs)$x)

However this results in 100% NaNs....

I'm wondering if I am doing something wrong?

The text was updated successfully, but these errors were encountered:

matthiasheinig · 2019-01-10T10:54:32Z

Hi Chris, do you have zeros in your normalized PWM? We recommend adding small pseudo counts to your frequency matrix.. Best, Matthias

Am 10.01.2019 um 11:36 schrieb Chris Clarkson ***@***.***>: Hi, I am interested in using your package to calculating the affinity of predicted binding sites and subsequently the significance of the affinity values. My pipeline (using your instructions from https://rdrr.io/github/matthuska/tRap/man/ <https://rdrr.io/github/matthuska/tRap/man/>) is as follows: pfm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 A 2 3 5 1 0 13 1 3 18 0 6 1 0 1 1 6 1 2 6 2 C 8 3 2 17 19 0 12 10 1 0 0 1 0 1 18 0 12 7 6 5 G 2 10 12 1 0 5 7 1 1 20 14 14 20 18 0 14 5 3 7 7 T 8 4 2 1 0 2 0 6 0 0 0 5 0 1 1 0 1 8 1 6 pwm <- toPWM(pfm) pwm=PWMatrix(ID="Unknown", name=tf, matrixClass="Unknown", strand="+", bg=c(A=0.25, C=0.25, G=0.25, T=0.25), tags=list(), profileMatrix=pwm, pseudocounts=numeric()) peaks = searchSeq(pwm, seq, min.score = "80%",mc.cores=10L) peaks_bed = as(peaks, "GRanges") head(as.data.frame(peaks_bed$siteSeqs)$x) [1] "AGCCCACTAGGGTGCAGTCC" "ATACCAGAAGAAGGCATCAG" "ACACCAGAAGAGGGCGTCAG" [4] "ATGCCACGAGGTGGAGATAA" "GACTCACTAGAGGGCACAGG" "TCTACAGCAGGTGGCAACAC" ***@***.***), as.data.frame(peaks_bed$siteSeqs)$x) However this results in a many NaN values.... sum(af=='NaN')/length(af) [1] 0.4785195 I was advised that tRap is used on long sequences rather than short ones so I extended the sequences: start(peaks_bed) <- start(peaks_bed) - 30 end(peaks_bed) <- end(peaks_bed) + 30 extended_seqs <- getSeq(Mmusculus, peaks_bed) ***@***.***),as.data.frame(extended_seqs)$x) However this results in 100% NaNs.... I'm wondering if I am doing something wrong? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#3>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIHrHuHLhvR5OL5vxXhUlIcxqhTEVLXQks5vBxfAgaJpZM4Z5OGv>.

Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDirig'in Petra Steiner-Hoffmann Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671

chrisclarkson · 2019-01-10T11:50:33Z

Hi Matthias,
Thank you for your quick reply:
There are not many zeros in the normalised matrix:

normalize.pwm(pwm@profileMatrix)
           1          2           3          4          5          6
A  1.0626965  0.9503723  0.04448988  0.4578354  0.3879632 -0.2923233
C -0.5626965  0.9503723  0.85949647 -0.3735062 -0.1638895  1.0223918
G  1.0626965 -1.3188119 -0.76348283  0.4578354  0.3879632  0.0000000
T -0.5626965  0.4180672  0.85949647  0.4578354  0.3879632  0.2699315
            7          8          9         10          11          12
A  0.41349136  0.4404797 -0.2536981  0.3870730 -0.03296475  0.71519479
C -0.24047408 -0.6112445  0.2969491  0.3870730  0.61061995  0.71519479
G -0.09176563  1.3303426  0.2969491 -0.1612191 -0.18827516 -0.45258182
T  0.91874835 -0.1595778  0.6597998  0.3870730  0.61061995  0.02219224
          13         14         15          16          17         18
A  0.3870730  0.4538871  0.2969491 -0.03296475  0.75263253  1.5229892
C  0.3870730  0.4538871 -0.2536981  0.61061995 -0.47909620 -0.5761614
G -0.1612191 -0.3616612  0.6597998 -0.18827516 -0.02616885  0.8595932
T  0.3870730  0.4538871  0.2969491  0.61061995  0.75263253 -0.8064209
          19         20
A -0.2228909  2.3968502
C -0.2228909  0.0000000
G -0.4123795 -0.9067515
T  1.8581614 -0.4900988

when I try the following, I still get NaNs:

affinity(normalize.pwm(pwm@profileMatrix)+0.0000000000000001, as.data.frame(tmp)$x)
[1] NaN
Warning messages:
1: In log(maxCG/p[3]) : NaNs produced
2: In log(maxAT/p) : NaNs produced

matthiasheinig · 2019-01-10T14:41:47Z

The PWM you put into the affinity function should be a probability matrix with values between [0-1]

Am 10.01.2019 um 12:50 schrieb Chris Clarkson ***@***.***>: Hi Matthias, Thank you for your quick reply: There are no zeros in the normalised matrix: ***@***.***) 1 2 3 4 5 6 A 1.0626965 0.9503723 0.04448988 0.4578354 0.3879632 -0.2923233 C -0.5626965 0.9503723 0.85949647 -0.3735062 -0.1638895 1.0223918 G 1.0626965 -1.3188119 -0.76348283 0.4578354 0.3879632 0.0000000 T -0.5626965 0.4180672 0.85949647 0.4578354 0.3879632 0.2699315 7 8 9 10 11 12 A 0.41349136 0.4404797 -0.2536981 0.3870730 -0.03296475 0.71519479 C -0.24047408 -0.6112445 0.2969491 0.3870730 0.61061995 0.71519479 G -0.09176563 1.3303426 0.2969491 -0.1612191 -0.18827516 -0.45258182 T 0.91874835 -0.1595778 0.6597998 0.3870730 0.61061995 0.02219224 13 14 15 16 17 18 A 0.3870730 0.4538871 0.2969491 -0.03296475 0.75263253 1.5229892 C 0.3870730 0.4538871 -0.2536981 0.61061995 -0.47909620 -0.5761614 G -0.1612191 -0.3616612 0.6597998 -0.18827516 -0.02616885 0.8595932 T 0.3870730 0.4538871 0.2969491 0.61061995 0.75263253 -0.8064209 19 20 A -0.2228909 2.3968502 C -0.2228909 0.0000000 G -0.4123795 -0.9067515 T 1.8581614 -0.4900988 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIHrHiWrbvdHBb8gUD9wpxql8GFE3g69ks5vBykKgaJpZM4Z5OGv>.

Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDirig'in Petra Steiner-Hoffmann Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671

chrisclarkson · 2019-01-10T22:26:16Z

Hi sorry for delay,
Hmm very strange.... I just got the PFM from the jaspar 2018 database and then TFBSTools::toPWM command which results in a matrix like the one seen above... Can you recommend a package/ command that could perform the conversion (PFM>PWM) in the way that is necessary for your package to work?

matthiasheinig · 2019-01-11T07:45:01Z

Hi Chris, I would recommend to use normalize.pwm from the tRap package directly on your PFM (and add a pseudo count of 0.25) pwm.for.trap <- normalize.pwm(pfm + 0.25) affinity(pwm.for.trap) Best, Matthias

Am 10.01.2019 um 23:26 schrieb Chris Clarkson ***@***.***>: Hi sorry for delay, Hmm very strange.... I just got the PFM from the jaspar 2018 database and then TFBSTools::toPWM command which results in a matrix like the one seen above... Can you recommend a package/ command that could perform the conversion (PFM>PWM) in the way that is necessary for your package to work? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIHrHmF1snaakiE0i4svmEBNDrjRdcqIks5vB74IgaJpZM4Z5OGv>.

Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDirig'in Petra Steiner-Hoffmann Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671

chrisclarkson · 2019-01-11T10:55:53Z

Hi again Matthias,
Thank you for this fantastic help.
It works now
Just 2 last questions:
1.
I would also like to apply this analysis to more than one transcription factor- hence if I download a list of PFMs from Jaspar:

ARNT
  [,1] [,2] [,3] [,4] [,5] [,6]
A    4   19    0    0    0    0
C   16    0   20    0    0    0
G    0    1    0   20    0   20
T    0    0    0    0   20    0

AHR
  [,1] [,2] [,3] [,4] [,5] [,6]
A    3    0    0    0    0    0
C    8    0   23    0    0    0
G    2   23    0   23    0   24
T   11    1    1    1   24    0

Ddit3::Cebpa
  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
A   14   11   18    0    0    4   38   36    0    14     4     0
C    7    7    3    1    0   33    1    2    6    17    23    26
G   12   14   15    0   38    0    0    1    0     5     9     6
T    6    7    3   38    1    2    0    0   33     3     3     7
...

Can I take it as a suitable strategy to add 0.25 to all of these PFMs and then implement your analysis on them?

As for the command local.paffinity, I tried it on the calculated affinity values:

seqs=as.data.frame(extended_seqs)$x
head(seqs)
  A DNAStringSet instance of length 6
    width seq
[1]    79 TACGTAAGTACACTGTAGCTGTCTTCAGACACAC...TCAGATCTCATTATGGGTAGTTGTGAGCTACCA
[2]    79 TTTTACTTTCTCTCTCCCTCTTATTGCTAGATGC...ATAAACAGCTTGCTTCTGCCATGTTCTGCAGAA
[3]    79 GACATCTGAGTACCTTCCCTGTAAGAGAGCTTGC...CTGAGCACTGAAACTCAGAGGAGAGAATCTGTC

head(af_ext)
[1] 10.586463 12.458601 10.153033  7.571788  9.838501 10.966423

af_ext=affinity(normalize.pwm(pwm.for.trap@profileMatrix),seqs)

for(i in c(1:length(af_ext))){
         print(local.paffinity(af_ext[i],pwm.for.trap,seqs[i]))
}

[1] 0.01612903
[1] 0.01612903
[1] 0.01612903
[1] 0.01612903
[1] 0.01612903
........

The p-values is 0.01612903 in every case....
Can I take these values as correct or am I not applying this function correctly?

matthiasheinig · 2019-01-11T11:13:28Z

Hi Chris, re1: I looked at the code again - should have done this first! Actually you can also pass the unnormalized PWM (counts >= 0) to the affinity function. The pseudo count is an argument of the affinity function. So no need to call normalize.pwm and add pseudo counts.. re2: the local.paffinity function takes arguments: - the actual affinity computed from the actual sequences - a long background sequence from which to compute the background affinities - window size should be set to the size of the sequence used to compute the actual affinity Best, Matthias

Am 11.01.2019 um 11:55 schrieb Chris Clarkson ***@***.***>: Hi again Matthias, Thank you for this fantastic help. It works now Just 2 last questions: 1. I would also like to apply this analysis to more than one transcription factor- hence if I download a list of PFMs from Jaspar: ARNT [,1] [,2] [,3] [,4] [,5] [,6] A 4 19 0 0 0 0 C 16 0 20 0 0 0 G 0 1 0 20 0 20 T 0 0 0 0 20 0 AHR [,1] [,2] [,3] [,4] [,5] [,6] A 3 0 0 0 0 0 C 8 0 23 0 0 0 G 2 23 0 23 0 24 T 11 1 1 1 24 0 Ddit3::Cebpa [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] A 14 11 18 0 0 4 38 36 0 14 4 0 C 7 7 3 1 0 33 1 2 6 17 23 26 G 12 14 15 0 38 0 0 1 0 5 9 6 T 6 7 3 38 1 2 0 0 33 3 3 7 ... Can I take it as a suitable strategy to add 0.25 to all of these PFMs and then implement your analysis on them? As for the command local.paffinity, I tried it on the calculated affinity values: seqs=as.data.frame(extended_seqs)$x head(extended_seqs) A DNAStringSet instance of length 6 width seq [1] 79 TACGTAAGTACACTGTAGCTGTCTTCAGACACAC...TCAGATCTCATTATGGGTAGTTGTGAGCTACCA [2] 79 TTTTACTTTCTCTCTCCCTCTTATTGCTAGATGC...ATAAACAGCTTGCTTCTGCCATGTTCTGCAGAA [3] 79 GACATCTGAGTACCTTCCCTGTAAGAGAGCTTGC...CTGAGCACTGAAACTCAGAGGAGAGAATCTGTC head(af_ext) [1] 10.586463 12.458601 10.153033 7.571788 9.838501 10.966423 ***@***.***),seqs) for(i in c(1:length(af_ext))){ print(local.paffinity(af_ext[i],pwm.for.trap,seqs[i])) } [1] 0.01612903 [1] 0.01612903 [1] 0.01612903 [1] 0.01612903 [1] 0.01612903 ........ The p-values is 0.01612903 in every case.... Can I take these values as correct or am I not applying this function correctly? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIHrHnucBozWW_wPyYaiob9H5rQ7bjqyks5vCG26gaJpZM4Z5OGv>.

Helmholtz Zentrum Muenchen Deutsches Forschungszentrum fuer Gesundheit und Umwelt (GmbH) Ingolstaedter Landstr. 1 85764 Neuherberg www.helmholtz-muenchen.de Aufsichtsratsvorsitzende: MinDirig'in Petra Steiner-Hoffmann Stellv.Aufsichtsratsvorsitzender: MinDirig. Dr. Manfred Wolter Geschaeftsfuehrer: Prof. Dr. med. Dr. h.c. Matthias Tschoep, Heinrich Bassler Registergericht: Amtsgericht Muenchen HRB 6466 USt-IdNr: DE 129521671

chrisclarkson · 2019-01-11T11:51:09Z

Do you mean that I should use the unnormalised PFMs? Not the PWMs as they have values < 0 (as shown above)...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

getting mostly NaN values for affinity calculations #3

getting mostly NaN values for affinity calculations #3

chrisclarkson commented Jan 10, 2019

matthiasheinig commented Jan 10, 2019 via email

chrisclarkson commented Jan 10, 2019 •

edited

Loading

matthiasheinig commented Jan 10, 2019 via email

chrisclarkson commented Jan 10, 2019

matthiasheinig commented Jan 11, 2019 via email

chrisclarkson commented Jan 11, 2019 •

edited

Loading

matthiasheinig commented Jan 11, 2019 via email

chrisclarkson commented Jan 11, 2019

getting mostly NaN values for affinity calculations #3

getting mostly NaN values for affinity calculations #3

Comments

chrisclarkson commented Jan 10, 2019

matthiasheinig commented Jan 10, 2019 via email

chrisclarkson commented Jan 10, 2019 • edited Loading

matthiasheinig commented Jan 10, 2019 via email

chrisclarkson commented Jan 10, 2019

matthiasheinig commented Jan 11, 2019 via email

chrisclarkson commented Jan 11, 2019 • edited Loading

matthiasheinig commented Jan 11, 2019 via email

chrisclarkson commented Jan 11, 2019

chrisclarkson commented Jan 10, 2019 •

edited

Loading

chrisclarkson commented Jan 11, 2019 •

edited

Loading