-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep cigartuples #108
Open
heidi-holappa
wants to merge
46
commits into
master
Choose a base branch
from
keep_cigartuples
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Keep cigartuples #108
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Introduction
This pull request adds a new feature for predicting and correcting errors in long reads to IsoQuant.
Constants
The constants are collected in the start of the function correct_transcript_splice_sites from which the code execution starts. This way they can be conveniently moved outside of the function or re-configured in the future, if needed. One possible use case would be to give the user the option to select a strategy to use, or alter the constants with arguments.
Note
Constant "Threshold cases at location" is only used when "More conservative strategy" is True. See section "error prediction strategies" for additional information.
Extracting cases and computing deletions
The assigned_reads list contains ReadAssignment objects. From each read start and end locations and cigartuples are extracted and for each splice site between the start and end location deletions are counted. First the locations within start and end of read are extracted from the exons-list. It is important to note that a read my start and end in the middle of an exon.
For each matching location the location is first added to the splice_cite_cases dictionary, if missing. After this the deletions are computed from the cigartuples.
Note
The key-value 'del_pos_distr' is only needed for the more conservative strategy.
The data structure of information to be extracted is the following:
The computation of deletions from cigartuples happens in two steps. First the aligned location is extracted from the cigartuple:
After this, the cigartuple is iterated again and starting from the aligned location a length of the predefined window_size cigarcodes are extracted.
Note
This part of the code could be optimized by performing these two operations at once.
Correcting errors
The main function iterates through all extracted cases. If the reads aligned to the given location exceed MIN_N_OF_ALIGNED_READS, the location is verified for errors. If MORE_CONSERVATIVE_STRATEGY is selected, two additional verifications are made.
The most common deletion is stored to the dictionary as it is used in error correction if an error is found. It is stored containing the distance and direction. In exon start location the value is positive and in exon end location the value is negative. For this reason an absolute value is checked against ACCEPTED_DEL_CASES.
For locations with a suitable most common deletion case a candidate bases for a canonical pair are verified. Strand and the location of the case (start or end of exon) is taken into consideration. \
Warning
At the time it remains an open question whether the index correction is correctly set for IsoQuant. This needs to be verified.
Finally a list of corrected exons is created:
In more conservative strategy two additional validations are made. There has to be$n$ adjacent nucleotides that have larger or equal values to nucleotides in other positions (see explanation in next section):
Additionally there has to be$n$ (not necessarily adjacent nucleotides) for which a preset threshold is exceeded. Note that because of the first additional constraint, we can be certain that in the event of return value being True, clearly all nucleotides in the sublist of largest values also exceed this constraint.
Error prediction strategies
Two strategies for error prediction are available:
Conservative:
\begin{enumerate}
\textbfVery conservative:
Elaboration for condition 5:
Let$S$ be the list of elements in window and $A = {k_1, \ldots, k_n }$ be $n$ adjacent indices that is a sublist of $S$ . Let $B={h_1,\ldots,h_m}$ be the sublist of the remaining (possibly non-adjacent) indices in $S$ , so that $\forall h_i\in B\;h_i\notin A$ , $\forall k_j\in A\;k_j\notin B$ and $|A| + |B| = |S|$ .
Now for condition 3 to apply it holds that
Note: as this is a list of elements, it may have multiple elements with equal value.