Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep cigartuples #108

Open
wants to merge 46 commits into
base: master
Choose a base branch
from
Open

Keep cigartuples #108

wants to merge 46 commits into from

Conversation

heidi-holappa
Copy link
Collaborator

Introduction

This pull request adds a new feature for predicting and correcting errors in long reads to IsoQuant.

Constants

The constants are collected in the start of the function correct_transcript_splice_sites from which the code execution starts. This way they can be conveniently moved outside of the function or re-configured in the future, if needed. One possible use case would be to give the user the option to select a strategy to use, or alter the constants with arguments.

def correct_transcript_splice_sites(arguments):
  # ...
  
  ACCEPTED_DEL_CASES = [3, 4, 5, 6]
  SUPPORTED_STRANDS = ['+', '-']
  THRESHOLD_CASES_AT_LOCATION = 0.7
  MIN_N_OF_ALIGNED_READS = 5
  WINDOW_SIZE = 8
  MORE_CONSERVATIVE_STRATEGY = False
  
  # ...

Note

Constant "Threshold cases at location" is only used when "More conservative strategy" is True. See section "error prediction strategies" for additional information.

Extracting cases and computing deletions

The assigned_reads list contains ReadAssignment objects. From each read start and end locations and cigartuples are extracted and for each splice site between the start and end location deletions are counted. First the locations within start and end of read are extracted from the exons-list. It is important to note that a read my start and end in the middle of an exon.

for read_assignment in assigned_reads:
    read_start = read_assignment.corrected_exons[0][0]
    read_end = read_assignment.corrected_exons[-1][1]
    cigartuples = read_assignment.cigartuples
    if not cigartuples:
        continue
    count_deletions_for_splice_site_locations(arguments)

For each matching location the location is first added to the splice_cite_cases dictionary, if missing. After this the deletions are computed from the cigartuples.

Note

The key-value 'del_pos_distr' is only needed for the more conservative strategy.

def count_deletions_for_splice_site_locations(arguments):
    
    matching_locations = extract_splice_site_locations_within_aligned_read(read_start, read_end, exons)
    
    for location in matching_locations:
        if location not in splice_site_cases:
            splice_site_cases[location] = {
                'location_is_end': bool,  
                'deletions': {},
                'del_pos_distr': [0 for _ in range(WINDOW_SIZE)],
                'most_common_del': -1,
                'canonical_bases_found': False
            }
        
        aligned_location = extract_location_from_cigar_string(arguments)
        
        count_deletions_from_cigar_codes_in_given_window(arguments)

The data structure of information to be extracted is the following:

{
    'location': 
        {
            'location_is_end': bool,  
            'deletions': dict,
            'del_pos_distr': list,
            'most_common_del': int,
            'canonical_bases_found': bool
        }
}
  • location_is_end: A boolean indicating whether the location is the end of an exon
  • deletions: A dictionary with number of deletions as key and count of reads as value
  • del_pos_distr: A list of integers with length of the predefined window. Each index contains the count of deletions in the respective position in the window.
  • most_common_del: Integer presenting the most common case of deletion. If no distinct case is found, the value is $-1$. The value also indicates direction. If the location is the start of an exon, the value is positive. Otherwise the value is negative.
  • canonical_bases_found:}** A boolean stating whether there exists candidate bases for a canonical pair at the distance of the most_common_del from the current location.

The computation of deletions from cigartuples happens in two steps. First the aligned location is extracted from the cigartuple:

def extract_location_from_cigar_string(arguments):
    relative_position = splice_site_location - read_start
    alignment_position = 0
    ref_position = 0

    for cigar_code in cigartuples:

        if cigar_code[0] in [0, 2, 3, 7, 8]:
            ref_position += cigar_code[1]
        if ref_position <= relative_position and not \
                read_start + ref_position == read_end:
            alignment_position += cigar_code[1]
        else:
            return alignment_position + (cigar_code[1] - (ref_position - relative_position))

    return -1

After this, the cigartuple is iterated again and starting from the aligned location a length of the predefined window_size cigarcodes are extracted.

Note

This part of the code could be optimized by performing these two operations at once.

def count_deletions_from_cigar_codes_in_given_window(arguments):
    count_of_deletions = 0
    
    cigar_code_list = []
    location = 0

    if location_is_end:
        aligned_location = aligned_location - window_size + 1

    for cigar_code in cigartuples:
        if window_size == len(cigar_code_list):
            break
        if location + cigar_code[1] > aligned_location:
            overlap = location + \
                cigar_code[1] - (aligned_location + len(cigar_code_list))
            cigar_code_list.extend(
                [cigar_code[0] for _ in range(min(window_size -
                                                len(cigar_code_list), overlap))])
        location += cigar_code[1]

    for i in range(window_size):
        if i >= len(cigar_code_list):
            break
        if cigar_code_list[i] == 2:
            count_of_deletions += 1
            splice_site_data["del_pos_distr"][i] += 1
    
    if count_of_deletions not in splice_site_data["deletions"]:
        splice_site_data["deletions"][count_of_deletions] = 0
    
    splice_site_data["deletions"][count_of_deletions] += 1

Correcting errors

The main function iterates through all extracted cases. If the reads aligned to the given location exceed MIN_N_OF_ALIGNED_READS, the location is verified for errors. If MORE_CONSERVATIVE_STRATEGY is selected, two additional verifications are made.

def correct_splice_site_errors(arguments):
    locations_with_errors = []
    for case in splice_cite_locations:
        
        reads = sum of reads at current location
        if reads < MIN_N_OF_ALIGNED_READS:
            continue
        
        compute_most_common_del_and_verify_nucleotides(arguments)
        
        if MORE_CONSERVATIVE_STRATEGY:
            if not sublist_largest_values_exists(arguments):
                continue
            if not threshold_for_del_cases_exceeded(argument):
                continue

        if canonical pair is found:
            locations_with_errors.append(location of case)
    
    return locations_with_errors

The most common deletion is stored to the dictionary as it is used in error correction if an error is found. It is stored containing the distance and direction. In exon start location the value is positive and in exon end location the value is negative. For this reason an absolute value is checked against ACCEPTED_DEL_CASES.

def compute_most_common_del_and_verify_nucleotides(
        arguments):
    
    # Compute most common case of deletions
    splice_site_data["most_common_del"] = compute_most_common_case_of_deletions(
        arguments)
    
    # Extract nucleotides from most common deletion location if it is an accepted case
    if abs(splice_site_data["most_common_del"]) in ACCEPTED_DEL_CASES:
        extract_nucleotides_from_most_common_del_location(
            arguments)

For locations with a suitable most common deletion case a candidate bases for a canonical pair are verified. Strand and the location of the case (start or end of exon) is taken into consideration. \

Warning

At the time it remains an open question whether the index correction is correctly set for IsoQuant. This needs to be verified.

def extract_nucleotides_from_most_common_del_location(
        arguments):
    idx_correction = 0
    extraction_start = location + most_common_del + idx_correction
    extraction_end = location + most_common_del + 2 + idx_correction
    try:
        extracted_canonicals =  chr_record[extraction_start:extraction_end]
    except KeyError:
        extracted_canonicals = 'XX'
    
    canonical_pairs = {
        '+': {
            'start': ['AG', 'AC'],
            'end': ['GT', 'GC', 'AT']
        },
        '-': {
            'start': ['AC', 'GC', 'AC'],
            'end': ['CT', 'GT']
        }
    }
    
    if location is end:
        possible_canonicals = canonical_pairs[strand]['end']
    else:
        possible_canonicals = canonical_pairs[strand]['start']
    if extracted_canonicals in possible_canonicals:
        splice_site_data["canonical_bases_found"] = True

Finally a list of corrected exons is created:

def generate_updated_exon_list(arguments):
    updated_exons = []
    for exon in exons:
            updated_exon = exon
            if exon[0] in locations_with_errors:
                corrected_location = exon[0] + splice_site_cases[exon[0]]["most_common_del"]
                updated_exon = (corrected_location, exon[1])
            if exon[1] in locations_with_errors:
                corrected_location = exon[1] + splice_site_cases[exon[1]]["most_common_del"]
                updated_exon = (exon[0], corrected_location)
            updated_exons.append(updated_exon)
    return updated_exons

In more conservative strategy two additional validations are made. There has to be $n$ adjacent nucleotides that have larger or equal values to nucleotides in other positions (see explanation in next section):

def sublist_largest_values_exists(lst, n):
    largest_values = set(sorted(lst, reverse=True)[:n])
    count = 0

    for num in lst:
        if num in largest_values:
            count += 1
            if count >= n:
                return True
        else:
            count = 0

    return False

Additionally there has to be $n$ (not necessarily adjacent nucleotides) for which a preset threshold is exceeded. Note that because of the first additional constraint, we can be certain that in the event of return value being True, clearly all nucleotides in the sublist of largest values also exceed this constraint.

def threshold_for_del_cases_exceeded(arguments):
    total_cases = sum of deletions
    nucleotides_exceeding_treshold = 0
    for value in del_pos_distr:
        if value  > total_cases * THRESHOLD_CASES_AT_LOCATION:
            nucleotides_exceeding_treshold += 1
    return bool(nucleotides_exceeding_treshold >= abs(most_common_del))

Error prediction strategies

Two strategies for error prediction are available:

Conservative:
\begin{enumerate}

  1. There has to be a distinct most common case of deletions and it is one of the accepted deletion cases (constant ACCEPTED_DEL_CASES).
  2. There has to be a canonical pair at the distance of the most common case of deletions from the splice site (constant WINDOW_SIZE)
  3. The number of aligned reads at the given location must exceed a preset threshold (constant MIN_N_OF_ALIGNED_READS)

\textbfVery conservative:

  1. There has to be a distinct most common case of deletions and it is one of the accepted deletion cases (constant ACCEPTED_DEL_CASES).
  2. There has to be a canonical pair at the distance of the most common case of deletions from the splice site (constant WINDOW_SIZE)
  3. The number of aligned reads at the given location must exceed a preset threshold (constant MIN_N_OF_ALIGNED_READS)
  4. There has to be atleast $n$ indeces ($n$ is the distinct most common case of deletions), in which a threshold for deletions has to be exceeded (constant THRESHOLD_CASES_AT_LOCATION)
  5. There has to be $n$ adjacent nucleotides that have larger or equal values to nucleotides in other positions (see explanation below)

Elaboration for condition 5:

Let $S$ be the list of elements in window and $A = {k_1, \ldots, k_n }$ be $n$ adjacent indices that is a sublist of $S$. Let $B={h_1,\ldots,h_m}$ be the sublist of the remaining (possibly non-adjacent) indices in $S$, so that $\forall h_i\in B\;h_i\notin A$, $\forall k_j\in A\;k_j\notin B$ and $|A| + |B| = |S|$.

Now for condition 3 to apply it holds that

$$\forall S[k_j]\;\not\exists S[h_i]\;\text{s.t.}\;S[k_j] < S[h_i].$$

Note: as this is a list of elements, it may have multiple elements with equal value.

andrewprzh and others added 30 commits August 8, 2023 17:58
@andrewprzh andrewprzh self-requested a review August 31, 2023 15:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants