Added option to NuQCJob to annotate filtered fastq. #155

charles-cowart · 2024-09-24T01:14:22Z

Added option to NuQCJob that annotates filtered fastq files w/the optional descriptions that may be present in an original fastq file. These optional descriptions are ordinarily filtered out during filtering by minimap2.

charles-cowart · 2024-09-24T01:25:20Z

Requesting Lucas's review as well.

coveralls · 2024-09-24T01:25:20Z

Pull Request Test Coverage Report for Build 11076356137

Details

13 of 13 (100.0%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.06%) to 81.976%

Totals
Change from base Build 10862698601:	0.06%
Covered Lines:	2098
Relevant Lines:	2385

💛 - Coveralls

antgonza

This assumes same order between the two inputs and also for fwd/rev; however, I'm not sure this is accurate. The 2 classic discrepancies come from (a) multithreading finishing in different order, and/or (b) one of the mates not passing QC for example only writing the forward. @charles-cowart and @lucaspatel, do you know if these scenarios can happen here?

sequence_processing_pipeline/Commands.py

charles-cowart · 2024-09-24T16:51:03Z

This assumes same order between the two inputs and also for fwd/rev; however, I'm not sure this is accurate. The 2 classic discrepancies come from (a) multithreading finishing in different order, and/or (b) one of the mates not passing QC for example only writing the forward. @charles-cowart and @lucaspatel, do you know if these scenarios can happen here?

@antgonza When the original file is read in, lines like '@FS10001773:68:BTR67708-1611:1:1116:5200:3280/1 BX:Z:AAACATGGTCCCGGAATG' are broken up into key/value pairs and stored in a dictionary where '@FS10001773:68:BTR67708-1611:1:1116:5200:3280/1' is the key and the 'BX:Z:AAACATGGTCCCGGAATG' is the value. When the filtered file is later read line by line, whenever the line '@FS10001773:68:BTR67708-1611:1:1116:5200:3280/1' is found, it will be replaced by itself plus a space plus the value of '@FS10001773:68:BTR67708-1611:1:1116:5200:3280/1' in the dictionary. It won't matter what order the sequence appears in the file. Since the sequence-identifiers are also unique, this particular key won't be encountered more than once. If a key is never encountered in the filtered file, then we can safely assume it was filtered out.

In general it should work so long as the caller is responsible for making sure the filtered version of a fastq file is matched to the correct original fastq file. In this case, we're using the original muxed file to generate the dictionary for the filtered but still mixed file so R1 and R2 shouldn't be an issue.

lucaspatel

Clearly functional and well tested, but I still wonder if simply sanitizing the read IDs at the start, perhaps by concatenating the read ID and metadata then splitting them up again at the end would be more efficient/simpler.

Nevertheless, looks good to me.

sequence_processing_pipeline/Commands.py

wasade · 2024-09-25T01:03:22Z

That’s a great idea!On Sep 24, 2024, at 17:51, Charles Cowart ***@***.***> wrote: @charles-cowart commented on this pull request. In sequence_processing_pipeline/Commands.py:

+ # It is straightforward to process even large fastq files line by line

+ # sequence identifier lines are unique and easily identified as they are + # the only lines in a FASTQ file beginning with '@'. It is thus straight- + # forward to process even large files to build a mapping between + # sequence identifiers and optional metadata fields. + with open(original_path, 'r') as f: + for line in f: + if line.startswith('@'): + line = line.strip().split() + if len(line) != 2: + raise ValueError(f"'{original_path}' does not appear to " + "contain sequence identifiers with " + "optional metadata") + # where sequence identifier = line[0] and + # the optional description = line[1]. + mapping[line[0]] = line[1] Also, there is Lucas's proposed solution to essentially replace all whitespace with '#' or another character and see if minimap2 accepts it as a name and then we remove it afterward. I'm not certain minimap2 is using a fixed length to extract the string so I'm trying it out now. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

wasade · 2024-09-25T01:56:15Z

Likely sensitive to reorientation. Suggest ensuring pairing information is retained. # may be a valid reorientation separator tooOn Sep 24, 2024, at 18:13, Charles Cowart ***@***.***> wrote: @charles-cowart commented on this pull request. In sequence_processing_pipeline/Commands.py:

+ # It is straightforward to process even large fastq files line by line

+ # sequence identifier lines are unique and easily identified as they are + # the only lines in a FASTQ file beginning with '@'. It is thus straight- + # forward to process even large files to build a mapping between + # sequence identifiers and optional metadata fields. + with open(original_path, 'r') as f: + for line in f: + if line.startswith('@'): + line = line.strip().split() + if len(line) != 2: + raise ValueError(f"'{original_path}' does not appear to " + "contain sequence identifiers with " + "optional metadata") + # where sequence identifier = line[0] and + # the optional description = line[1]. + mapping[line[0]] = line[1] Also, there is Lucas's proposed solution to essentially replace all whitespace with '#' or another character and see if minimap2 accepts it as a name and then we remove it afterward. I'm not certain minimap2 is using a fixed length to extract the string so I'm trying it out now. Sadly this will not work. I took my list of ten sequences and ran them through as is, and four additional times, each time replacing the space in between the sequence id and the description with '_', '-', '#', and '###'. The output for each of the hacked sequences was blank. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

charles-cowart · 2024-09-25T17:00:03Z

@wasade Sorry could you clarify what you mean by, "Likely sensitive to reorientation. Suggest ensuring pairing information is retained." Happy to give this a try.

wasade · 2024-09-25T17:14:42Z

Paired end aware aligners are sensitive to the use of /1, /2 and the exact location they are present in the ID. There is no universal standard for denoting pairing, and if I recall correctly, minimap2 also interprets #1, #2. But, it is necessary these fall at the end of the ID.

What I recommend is improving the control of the ID encoding. Perhaps something like:

SPP_COMMENT_TOKEN = '_SPPCOMMENTTOKEN_'
SPP_ID_TOKEN = '_SPPIDTOKEN_'
# NOTE: the parser used may automatically parse the comment
# NOTE: this _assumes_ the ID is actually parsed so the "@" has been removed
original_id, original_comment = id_line.split(' ', 1)
# NOTE: i don't think other characters need to be replaced but I may be wrong
clean_comment = original_comment.replace(' ', SPP_COMMENT_TOKEN)
new_id = f"{clean_comment}{SPP_ID_TOKEN}{original_id}"

charles-cowart · 2024-09-25T18:05:36Z

Paired end aware aligners are sensitive to the use of /1, /2 and the exact location they are present in the ID. There is no universal standard for denoting pairing, and if I recall correctly, minimap2 also interprets #1, #2. But, it is necessary these fall at the end of the ID.

What I recommend is improving the control of the ID encoding. Perhaps something like:
SPP_COMMENT_TOKEN = '_SPPCOMMENTTOKEN_'
SPP_ID_TOKEN = '_SPPIDTOKEN_'
# NOTE: the parser used may automatically parse the comment
# NOTE: this _assumes_ the ID is actually parsed so the "@" has been removed
original_id, original_comment = id_line.split(' ', 1)
# NOTE: i don't think other characters need to be replaced but I may be wrong
clean_comment = original_comment.replace(' ', SPP_COMMENT_TOKEN)
new_id = f"{clean_comment}{SPP_ID_TOKEN}{original_id}"

Got it, thanks! I'll give it a try.

lucaspatel · 2024-09-26T17:46:07Z

Hi @wasade and @charles-cowart,

I did some digging in the minimap2 documentation, and I found this flag which may solve the problem directly:
-y Copy input FASTA/Q comments to output.

Charlie, could you give this a shot? According to this GitHub issue, inclusion of the flag should punt the FASTQ comment to a SAM flag so that the read ID is compatible with the SAM format while retaining the extra information.

wasade · 2024-09-26T17:52:35Z

nice!

charles-cowart · 2024-09-26T17:55:45Z

That’s awesome thanks! I’ll confirm the switch is in the version we use and if not I’ll see if we can upgrade.

…

On Thu, Sep 26, 2024 at 10:46 AM Lucas Patel ***@***.***> wrote: Hi @wasade <https://urldefense.com/v3/__https://github.com/wasade__;!!Mih3wA!HHvLLh4ElpJ-IfQ6bKhqMIlbFxeXNbNHdUMPsf7A7LVi81Zq2BbGeAoiHAqtO8nnQjYXz0QtuzapnkLn7Yj8TACm$> and @charles-cowart <https://urldefense.com/v3/__https://github.com/charles-cowart__;!!Mih3wA!HHvLLh4ElpJ-IfQ6bKhqMIlbFxeXNbNHdUMPsf7A7LVi81Zq2BbGeAoiHAqtO8nnQjYXz0QtuzapnkLn7VLVlrlf$> , I did some digging in the minimap2 documentation, and I found this flag which may solve the problem directly: -y Copy input FASTA/Q comments to output. Charlie, could you give this a shot? According to this GitHub issue <https://urldefense.com/v3/__https://github.com/lh3/minimap2/issues/136__;!!Mih3wA!HHvLLh4ElpJ-IfQ6bKhqMIlbFxeXNbNHdUMPsf7A7LVi81Zq2BbGeAoiHAqtO8nnQjYXz0QtuzapnkLn7ZfvbKlq$>, inclusion of the flag should punt the FASTQ comment to a SAM flag so that the read ID is compatible with the SAM format while retaining the extra information. — Reply to this email directly, view it on GitHub <https://urldefense.com/v3/__https://github.com/biocore/mg-scripts/pull/155*issuecomment-2377577108__;Iw!!Mih3wA!HHvLLh4ElpJ-IfQ6bKhqMIlbFxeXNbNHdUMPsf7A7LVi81Zq2BbGeAoiHAqtO8nnQjYXz0QtuzapnkLn7WDH-h0T$>, or unsubscribe <https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKFU7E5WX6KSEPAMF5XHYWLZYRB7LAVCNFSM6AAAAABOXGGNFOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZXGU3TOMJQHA__;!!Mih3wA!HHvLLh4ElpJ-IfQ6bKhqMIlbFxeXNbNHdUMPsf7A7LVi81Zq2BbGeAoiHAqtO8nnQjYXz0QtuzapnkLn7T93XDD2$> . You are receiving this because you were mentioned.Message ID: ***@***.***>

charles-cowart · 2024-09-26T19:15:36Z

Thanks Lucas! Here's what I've confirmed so far:

the minimap2 version we are using appears to be the latest one.
-y (little y not big Y) is not mentioned in the user guide but is mentioned in the ‘man’ page.
-y does not appear in minimap2’s help screen and the man page is not installed (but searchable online)
looking at the code in main.c, even though -y isn’t listed in the help, it is mapped by ketopt() to MM_F_COPY_COMMENT. Suggests it was implemented.
description does appear in minimap2's SAM format output! '-y' switch does work.
description does not appear in samtool's FASTQ output. :( Looking into that now.

lucaspatel · 2024-09-26T19:40:46Z

Thanks Lucas! Here's what I've confirmed so far:

the minimap2 version we are using appears to be the latest one.

-y (little y not big Y) is not mentioned in the user guide but is mentioned in the ‘man’ page.

-y does not appear in minimap2’s help screen and the man page is not installed (but searchable online)

looking at the code in main.c, even though -y isn’t listed in the help, it is mapped by ketopt() to MM_F_COPY_COMMENT. Suggests it was implemented.

description does appear in minimap2's SAM format output! '-y' switch does work.

description does not appear in samtool's FASTQ output. :( Looking into that now.

Are the SAM tags "RG, BC or QT" tags? If so, they may be recoverable with the -t flag to samtools fastq. See the documentation for more details.

charles-cowart · 2024-09-26T19:49:11Z

Ty, yes I looked over the SAM-formatted results and tried -t an -T from https://samtools.github.io/hts-specs/SAMv1.pdf. Just got some acceptable output using 'BX', which is the first part of the descriptions themselves (BX:Z:CAGACACGTAGGTGGGAC). It's interesting because it's not position-based, even though it has what looks like a full SAM header. This looks awesome thanks Lucas! The only potential downside I can see is that we need to know the code (in this case 'BX') in advance. Is that something you guys would know or would the SPP have to discover it?

charles-cowart · 2024-09-26T19:50:31Z

Just to clarify, the command I'm using specifically is:
'cat intermediate_result.sam | samtools fastq -@ 8 -f 12 -F 256 -T BX'

charles-cowart · 2024-09-26T22:37:50Z

Confirmed minimap2 gracefully handles the case where '-y' is passed to it but there is no additional descriptions - output is as desired.
Confirmed samtools behaves gracefully when given nonexistent options w/-T along with BX (-T BX,AB) - if BX is present in the input it will appear in the output regardless of the other requested values.

charles-cowart · 2024-09-27T00:31:24Z

Ready for review! This version will rely on the user to supply a list of one or more tags such as 'BX' in order to preserve that metadata through host-filtering. The SPP plugin itself can keep a list of these around in configuration or the user can supply the tags they believe are present. We can alternatively attempt to scour the first few sequence-identifier lines for any present tags, but I'd like a few more samples before putting something together.

sequence_processing_pipeline/NuQCJob.py

antgonza · 2024-09-27T12:42:52Z

@charles-cowart, should we just preserve all tags? I might be missing something obvious about when we do not need them.

charles-cowart · 2024-09-27T19:42:04Z

@charles-cowart, should we just preserve all tags? I might be missing something obvious about when we do not need them.

Just to clarify, I did run several tests w/samtools to preserve all tags from minimap2's output. Samtools documentation says the following:

-T TAGLIST
Specify a comma-separated list of tags to copy to the FASTQ header line, if they exist. TAGLIST can be blank or * to indicate all >tags should be copied to the output. If using *, be careful to quote it to avoid unwanted shell expansion.

However, I found that "-T", "-T *", and "-T '*'" all resulted samtools displaying error messages and no generated output. The only way I've found so far to preserve the descriptions is to specify them 'by name' as '-T BX'. We can specify '-T BX,AB,CD,EF' and it will ignore non-existent tags. Note the version we currently use on barnacle is samtools 1.12.

Hence, my current thinking is that we can develop a list of tags as we encounter and/or need them, specify them in SPP's configuration file, and pass them to NuQCJob.

charles-cowart · 2024-10-01T20:34:49Z

Hi @wasade , @AmandaBirmingham, if it's no trouble, would you both mind giving it a final look over? TYVM!

wasade · 2024-10-01T20:40:17Z

sequence_processing_pipeline/NuQCJob.py

+            # add tags for known metadata types that fastq files may have
+            # been annotated with. Samtools will safely ignore tags that
+            # are not present.
+            tags = " -T %s" % ','.join(self.additional_fastq_tags)


Suggested change

tags = " -T %s" % ','.join(self.additional_fastq_tags)

tags = "-T %s" % ','.join(self.additional_fastq_tags)

The odd spacing here is intentional. It prevents an extra space from being present when there are no tags. This makes the expected results for tests uniform.

wasade · 2024-10-01T20:40:36Z

sequence_processing_pipeline/NuQCJob.py

                        f"{mmi_db_path} {input} -a | samtools fastq -@ "
-                        f"{cores_to_allocate} -f 12 -F 256 > {output}")
+                        f"{cores_to_allocate} -f 12 -F 256{tags} > "


Suggested change

f"{cores_to_allocate} -f 12 -F 256{tags} > "

f"{cores_to_allocate} -f 12 -F 256 {tags} > "

charles-cowart · 2024-10-04T01:16:37Z

Thank you everyone for all your input! I'm going to merge this and make it available to my other PRs.

Added option to NuQCJob to annotate filtered fastq.

4d310d6

Added option to NuQCJob that annotates filtered fastq files w/the optional descriptions that may be present in an original fastq file. These optional descriptions are ordinarily filtered out during filtering by minimap2.

charles-cowart requested a review from antgonza September 24, 2024 01:24

antgonza requested changes Sep 24, 2024

View reviewed changes

sequence_processing_pipeline/Commands.py Outdated Show resolved Hide resolved

Removed timing information

33dd253

charles-cowart requested a review from lucaspatel September 24, 2024 20:38

lucaspatel approved these changes Sep 24, 2024

View reviewed changes

sequence_processing_pipeline/Commands.py Outdated Show resolved Hide resolved

wasade reviewed Sep 24, 2024

View reviewed changes

sequence_processing_pipeline/Commands.py Outdated Show resolved Hide resolved

wasade reviewed Sep 24, 2024

View reviewed changes

sequence_processing_pipeline/Commands.py Outdated Show resolved Hide resolved

Updated to use minimap2 and samtools functionality.

fc15d1a

antgonza reviewed Sep 27, 2024

View reviewed changes

sequence_processing_pipeline/NuQCJob.py Outdated Show resolved Hide resolved

Updates based on feedback

e28d96e

antgonza approved these changes Sep 30, 2024

View reviewed changes

wasade reviewed Oct 1, 2024

View reviewed changes

charles-cowart merged commit 9d5b7d4 into biocore:master Oct 4, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added option to NuQCJob to annotate filtered fastq. #155

Added option to NuQCJob to annotate filtered fastq. #155

charles-cowart commented Sep 24, 2024

charles-cowart commented Sep 24, 2024

coveralls commented Sep 24, 2024 •

edited

Loading

antgonza left a comment

charles-cowart commented Sep 24, 2024

lucaspatel left a comment

wasade commented Sep 25, 2024 via email

wasade commented Sep 25, 2024 via email

charles-cowart commented Sep 25, 2024 •

edited

Loading

wasade commented Sep 25, 2024

charles-cowart commented Sep 25, 2024

lucaspatel commented Sep 26, 2024 •

edited

Loading

wasade commented Sep 26, 2024

charles-cowart commented Sep 26, 2024 via email

charles-cowart commented Sep 26, 2024

lucaspatel commented Sep 26, 2024

charles-cowart commented Sep 26, 2024

charles-cowart commented Sep 26, 2024

charles-cowart commented Sep 26, 2024

charles-cowart commented Sep 27, 2024

antgonza commented Sep 27, 2024

charles-cowart commented Sep 27, 2024 •

edited

Loading

charles-cowart commented Oct 1, 2024

wasade Oct 1, 2024

charles-cowart Oct 1, 2024

wasade Oct 1, 2024

charles-cowart commented Oct 4, 2024

	tags = " -T %s" % ','.join(self.additional_fastq_tags)
	tags = "-T %s" % ','.join(self.additional_fastq_tags)

	f"{cores_to_allocate} -f 12 -F 256{tags} > "
	f"{cores_to_allocate} -f 12 -F 256 {tags} > "

Added option to NuQCJob to annotate filtered fastq. #155

Added option to NuQCJob to annotate filtered fastq. #155

Conversation

charles-cowart commented Sep 24, 2024

charles-cowart commented Sep 24, 2024

coveralls commented Sep 24, 2024 • edited Loading

Pull Request Test Coverage Report for Build 11076356137

Details

💛 - Coveralls

antgonza left a comment

Choose a reason for hiding this comment

charles-cowart commented Sep 24, 2024

lucaspatel left a comment

Choose a reason for hiding this comment

wasade commented Sep 25, 2024 via email

wasade commented Sep 25, 2024 via email

charles-cowart commented Sep 25, 2024 • edited Loading

wasade commented Sep 25, 2024

charles-cowart commented Sep 25, 2024

lucaspatel commented Sep 26, 2024 • edited Loading

wasade commented Sep 26, 2024

charles-cowart commented Sep 26, 2024 via email

charles-cowart commented Sep 26, 2024

lucaspatel commented Sep 26, 2024

charles-cowart commented Sep 26, 2024

charles-cowart commented Sep 26, 2024

charles-cowart commented Sep 26, 2024

charles-cowart commented Sep 27, 2024

antgonza commented Sep 27, 2024

charles-cowart commented Sep 27, 2024 • edited Loading

charles-cowart commented Oct 1, 2024

wasade Oct 1, 2024

Choose a reason for hiding this comment

charles-cowart Oct 1, 2024

Choose a reason for hiding this comment

wasade Oct 1, 2024

Choose a reason for hiding this comment

charles-cowart commented Oct 4, 2024

coveralls commented Sep 24, 2024 •

edited

Loading

charles-cowart commented Sep 25, 2024 •

edited

Loading

lucaspatel commented Sep 26, 2024 •

edited

Loading

charles-cowart commented Sep 27, 2024 •

edited

Loading