-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added option to NuQCJob to annotate filtered fastq. #155
Added option to NuQCJob to annotate filtered fastq. #155
Conversation
Added option to NuQCJob that annotates filtered fastq files w/the optional descriptions that may be present in an original fastq file. These optional descriptions are ordinarily filtered out during filtering by minimap2.
Requesting Lucas's review as well. |
Pull Request Test Coverage Report for Build 11076356137Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes same order between the two inputs and also for fwd/rev; however, I'm not sure this is accurate. The 2 classic discrepancies come from (a) multithreading finishing in different order, and/or (b) one of the mates not passing QC for example only writing the forward. @charles-cowart and @lucaspatel, do you know if these scenarios can happen here?
@antgonza When the original file is read in, lines like '@FS10001773:68:BTR67708-1611:1:1116:5200:3280/1 BX:Z:AAACATGGTCCCGGAATG' are broken up into key/value pairs and stored in a dictionary where '@FS10001773:68:BTR67708-1611:1:1116:5200:3280/1' is the key and the 'BX:Z:AAACATGGTCCCGGAATG' is the value. When the filtered file is later read line by line, whenever the line '@FS10001773:68:BTR67708-1611:1:1116:5200:3280/1' is found, it will be replaced by itself plus a space plus the value of '@FS10001773:68:BTR67708-1611:1:1116:5200:3280/1' in the dictionary. It won't matter what order the sequence appears in the file. Since the sequence-identifiers are also unique, this particular key won't be encountered more than once. If a key is never encountered in the filtered file, then we can safely assume it was filtered out. In general it should work so long as the caller is responsible for making sure the filtered version of a fastq file is matched to the correct original fastq file. In this case, we're using the original muxed file to generate the dictionary for the filtered but still mixed file so R1 and R2 shouldn't be an issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clearly functional and well tested, but I still wonder if simply sanitizing the read IDs at the start, perhaps by concatenating the read ID and metadata then splitting them up again at the end would be more efficient/simpler.
Nevertheless, looks good to me.
That’s a great idea!On Sep 24, 2024, at 17:51, Charles Cowart ***@***.***> wrote:
@charles-cowart commented on this pull request.
In sequence_processing_pipeline/Commands.py:
+ # It is straightforward to process even large fastq files line by line
+ # sequence identifier lines are unique and easily identified as they are
+ # the only lines in a FASTQ file beginning with '@'. It is thus straight-
+ # forward to process even large files to build a mapping between
+ # sequence identifiers and optional metadata fields.
+ with open(original_path, 'r') as f:
+ for line in f:
+ if line.startswith('@'):
+ line = line.strip().split()
+ if len(line) != 2:
+ raise ValueError(f"'{original_path}' does not appear to "
+ "contain sequence identifiers with "
+ "optional metadata")
+ # where sequence identifier = line[0] and
+ # the optional description = line[1].
+ mapping[line[0]] = line[1]
Also, there is Lucas's proposed solution to essentially replace all whitespace with '#' or another character and see if minimap2 accepts it as a name and then we remove it afterward. I'm not certain minimap2 is using a fixed length to extract the string so I'm trying it out now.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Likely sensitive to reorientation. Suggest ensuring pairing information is retained. # may be a valid reorientation separator tooOn Sep 24, 2024, at 18:13, Charles Cowart ***@***.***> wrote:
@charles-cowart commented on this pull request.
In sequence_processing_pipeline/Commands.py:
+ # It is straightforward to process even large fastq files line by line
+ # sequence identifier lines are unique and easily identified as they are
+ # the only lines in a FASTQ file beginning with '@'. It is thus straight-
+ # forward to process even large files to build a mapping between
+ # sequence identifiers and optional metadata fields.
+ with open(original_path, 'r') as f:
+ for line in f:
+ if line.startswith('@'):
+ line = line.strip().split()
+ if len(line) != 2:
+ raise ValueError(f"'{original_path}' does not appear to "
+ "contain sequence identifiers with "
+ "optional metadata")
+ # where sequence identifier = line[0] and
+ # the optional description = line[1].
+ mapping[line[0]] = line[1]
Also, there is Lucas's proposed solution to essentially replace all whitespace with '#' or another character and see if minimap2 accepts it as a name and then we remove it afterward. I'm not certain minimap2 is using a fixed length to extract the string so I'm trying it out now.
Sadly this will not work. I took my list of ten sequences and ran them through as is, and four additional times, each time replacing the space in between the sequence id and the description with '_', '-', '#', and '###'. The output for each of the hacked sequences was blank.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
@wasade Sorry could you clarify what you mean by, "Likely sensitive to reorientation. Suggest ensuring pairing information is retained." Happy to give this a try. |
Paired end aware aligners are sensitive to the use of What I recommend is improving the control of the ID encoding. Perhaps something like: SPP_COMMENT_TOKEN = '_SPPCOMMENTTOKEN_'
SPP_ID_TOKEN = '_SPPIDTOKEN_'
# NOTE: the parser used may automatically parse the comment
# NOTE: this _assumes_ the ID is actually parsed so the "@" has been removed
original_id, original_comment = id_line.split(' ', 1)
# NOTE: i don't think other characters need to be replaced but I may be wrong
clean_comment = original_comment.replace(' ', SPP_COMMENT_TOKEN)
new_id = f"{clean_comment}{SPP_ID_TOKEN}{original_id}" |
Got it, thanks! I'll give it a try. |
Hi @wasade and @charles-cowart, I did some digging in the minimap2 documentation, and I found this flag which may solve the problem directly: Charlie, could you give this a shot? According to this GitHub issue, inclusion of the flag should punt the FASTQ comment to a SAM flag so that the read ID is compatible with the SAM format while retaining the extra information. |
nice! |
Thanks Lucas! Here's what I've confirmed so far:
|
Are the SAM tags "RG, BC or QT" tags? If so, they may be recoverable with the |
Ty, yes I looked over the SAM-formatted results and tried -t an -T from https://samtools.github.io/hts-specs/SAMv1.pdf. Just got some acceptable output using 'BX', which is the first part of the descriptions themselves (BX:Z:CAGACACGTAGGTGGGAC). It's interesting because it's not position-based, even though it has what looks like a full SAM header. This looks awesome thanks Lucas! The only potential downside I can see is that we need to know the code (in this case 'BX') in advance. Is that something you guys would know or would the SPP have to discover it? |
Just to clarify, the command I'm using specifically is: |
Confirmed minimap2 gracefully handles the case where '-y' is passed to it but there is no additional descriptions - output is as desired. |
Ready for review! This version will rely on the user to supply a list of one or more tags such as 'BX' in order to preserve that metadata through host-filtering. The SPP plugin itself can keep a list of these around in configuration or the user can supply the tags they believe are present. We can alternatively attempt to scour the first few sequence-identifier lines for any present tags, but I'd like a few more samples before putting something together. |
@charles-cowart, should we just preserve all tags? I might be missing something obvious about when we do not need them. |
Just to clarify, I did run several tests w/samtools to preserve all tags from minimap2's output. Samtools documentation says the following:
However, I found that "-T", "-T *", and "-T '*'" all resulted samtools displaying error messages and no generated output. The only way I've found so far to preserve the descriptions is to specify them 'by name' as '-T BX'. We can specify '-T BX,AB,CD,EF' and it will ignore non-existent tags. Note the version we currently use on barnacle is samtools 1.12. Hence, my current thinking is that we can develop a list of tags as we encounter and/or need them, specify them in SPP's configuration file, and pass them to NuQCJob. |
Hi @wasade , @AmandaBirmingham, if it's no trouble, would you both mind giving it a final look over? TYVM! |
# add tags for known metadata types that fastq files may have | ||
# been annotated with. Samtools will safely ignore tags that | ||
# are not present. | ||
tags = " -T %s" % ','.join(self.additional_fastq_tags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tags = " -T %s" % ','.join(self.additional_fastq_tags) | |
tags = "-T %s" % ','.join(self.additional_fastq_tags) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The odd spacing here is intentional. It prevents an extra space from being present when there are no tags. This makes the expected results for tests uniform.
f"{mmi_db_path} {input} -a | samtools fastq -@ " | ||
f"{cores_to_allocate} -f 12 -F 256 > {output}") | ||
f"{cores_to_allocate} -f 12 -F 256{tags} > " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
f"{cores_to_allocate} -f 12 -F 256{tags} > " | |
f"{cores_to_allocate} -f 12 -F 256 {tags} > " |
Thank you everyone for all your input! I'm going to merge this and make it available to my other PRs. |
Added option to NuQCJob that annotates filtered fastq files w/the optional descriptions that may be present in an original fastq file. These optional descriptions are ordinarily filtered out during filtering by minimap2.