Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added option to NuQCJob to annotate filtered fastq. #155

Merged
merged 4 commits into from
Oct 4, 2024

Conversation

charles-cowart
Copy link
Collaborator

Added option to NuQCJob that annotates filtered fastq files w/the optional descriptions that may be present in an original fastq file. These optional descriptions are ordinarily filtered out during filtering by minimap2.

Added option to NuQCJob that annotates filtered fastq files w/the
optional descriptions that may be present in an original fastq file.
These optional descriptions are ordinarily filtered out during filtering
by minimap2.
@charles-cowart
Copy link
Collaborator Author

Requesting Lucas's review as well.

@coveralls
Copy link

coveralls commented Sep 24, 2024

Pull Request Test Coverage Report for Build 11076356137

Details

  • 13 of 13 (100.0%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.06%) to 81.976%

Totals Coverage Status
Change from base Build 10862698601: 0.06%
Covered Lines: 2098
Relevant Lines: 2385

💛 - Coveralls

Copy link
Contributor

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumes same order between the two inputs and also for fwd/rev; however, I'm not sure this is accurate. The 2 classic discrepancies come from (a) multithreading finishing in different order, and/or (b) one of the mates not passing QC for example only writing the forward. @charles-cowart and @lucaspatel, do you know if these scenarios can happen here?

sequence_processing_pipeline/Commands.py Outdated Show resolved Hide resolved
@charles-cowart
Copy link
Collaborator Author

This assumes same order between the two inputs and also for fwd/rev; however, I'm not sure this is accurate. The 2 classic discrepancies come from (a) multithreading finishing in different order, and/or (b) one of the mates not passing QC for example only writing the forward. @charles-cowart and @lucaspatel, do you know if these scenarios can happen here?

@antgonza When the original file is read in, lines like '@FS10001773:68:BTR67708-1611:1:1116:5200:3280/1 BX:Z:AAACATGGTCCCGGAATG' are broken up into key/value pairs and stored in a dictionary where '@FS10001773:68:BTR67708-1611:1:1116:5200:3280/1' is the key and the 'BX:Z:AAACATGGTCCCGGAATG' is the value. When the filtered file is later read line by line, whenever the line '@FS10001773:68:BTR67708-1611:1:1116:5200:3280/1' is found, it will be replaced by itself plus a space plus the value of '@FS10001773:68:BTR67708-1611:1:1116:5200:3280/1' in the dictionary. It won't matter what order the sequence appears in the file. Since the sequence-identifiers are also unique, this particular key won't be encountered more than once. If a key is never encountered in the filtered file, then we can safely assume it was filtered out.

In general it should work so long as the caller is responsible for making sure the filtered version of a fastq file is matched to the correct original fastq file. In this case, we're using the original muxed file to generate the dictionary for the filtered but still mixed file so R1 and R2 shouldn't be an issue.

Copy link
Collaborator

@lucaspatel lucaspatel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clearly functional and well tested, but I still wonder if simply sanitizing the read IDs at the start, perhaps by concatenating the read ID and metadata then splitting them up again at the end would be more efficient/simpler.

Nevertheless, looks good to me.

sequence_processing_pipeline/Commands.py Outdated Show resolved Hide resolved
@wasade
Copy link
Member

wasade commented Sep 25, 2024 via email

@wasade
Copy link
Member

wasade commented Sep 25, 2024 via email

@charles-cowart
Copy link
Collaborator Author

charles-cowart commented Sep 25, 2024

@wasade Sorry could you clarify what you mean by, "Likely sensitive to reorientation. Suggest ensuring pairing information is retained." Happy to give this a try.

@wasade
Copy link
Member

wasade commented Sep 25, 2024

Paired end aware aligners are sensitive to the use of /1, /2 and the exact location they are present in the ID. There is no universal standard for denoting pairing, and if I recall correctly, minimap2 also interprets #1, #2. But, it is necessary these fall at the end of the ID.

What I recommend is improving the control of the ID encoding. Perhaps something like:

SPP_COMMENT_TOKEN = '_SPPCOMMENTTOKEN_'
SPP_ID_TOKEN = '_SPPIDTOKEN_'
# NOTE: the parser used may automatically parse the comment
# NOTE: this _assumes_ the ID is actually parsed so the "@" has been removed
original_id, original_comment = id_line.split(' ', 1)
# NOTE: i don't think other characters need to be replaced but I may be wrong
clean_comment = original_comment.replace(' ', SPP_COMMENT_TOKEN)
new_id = f"{clean_comment}{SPP_ID_TOKEN}{original_id}"

@charles-cowart
Copy link
Collaborator Author

Paired end aware aligners are sensitive to the use of /1, /2 and the exact location they are present in the ID. There is no universal standard for denoting pairing, and if I recall correctly, minimap2 also interprets #1, #2. But, it is necessary these fall at the end of the ID.

What I recommend is improving the control of the ID encoding. Perhaps something like:

SPP_COMMENT_TOKEN = '_SPPCOMMENTTOKEN_'
SPP_ID_TOKEN = '_SPPIDTOKEN_'
# NOTE: the parser used may automatically parse the comment
# NOTE: this _assumes_ the ID is actually parsed so the "@" has been removed
original_id, original_comment = id_line.split(' ', 1)
# NOTE: i don't think other characters need to be replaced but I may be wrong
clean_comment = original_comment.replace(' ', SPP_COMMENT_TOKEN)
new_id = f"{clean_comment}{SPP_ID_TOKEN}{original_id}"

Got it, thanks! I'll give it a try.

@lucaspatel
Copy link
Collaborator

lucaspatel commented Sep 26, 2024

Hi @wasade and @charles-cowart,

I did some digging in the minimap2 documentation, and I found this flag which may solve the problem directly:
-y Copy input FASTA/Q comments to output.

Charlie, could you give this a shot? According to this GitHub issue, inclusion of the flag should punt the FASTQ comment to a SAM flag so that the read ID is compatible with the SAM format while retaining the extra information.

@wasade
Copy link
Member

wasade commented Sep 26, 2024

nice!

@charles-cowart
Copy link
Collaborator Author

charles-cowart commented Sep 26, 2024 via email

@charles-cowart
Copy link
Collaborator Author

Thanks Lucas! Here's what I've confirmed so far:

  • the minimap2 version we are using appears to be the latest one.
  • -y (little y not big Y) is not mentioned in the user guide but is mentioned in the ‘man’ page.
  • -y does not appear in minimap2’s help screen and the man page is not installed (but searchable online)
  • looking at the code in main.c, even though -y isn’t listed in the help, it is mapped by ketopt() to MM_F_COPY_COMMENT. Suggests it was implemented.
  • description does appear in minimap2's SAM format output! '-y' switch does work.
  • description does not appear in samtool's FASTQ output. :( Looking into that now.

@lucaspatel
Copy link
Collaborator

Thanks Lucas! Here's what I've confirmed so far:

  • the minimap2 version we are using appears to be the latest one.
  • -y (little y not big Y) is not mentioned in the user guide but is mentioned in the ‘man’ page.
  • -y does not appear in minimap2’s help screen and the man page is not installed (but searchable online)
  • looking at the code in main.c, even though -y isn’t listed in the help, it is mapped by ketopt() to MM_F_COPY_COMMENT. Suggests it was implemented.
  • description does appear in minimap2's SAM format output! '-y' switch does work.
  • description does not appear in samtool's FASTQ output. :( Looking into that now.

Are the SAM tags "RG, BC or QT" tags? If so, they may be recoverable with the -t flag to samtools fastq. See the documentation for more details.

@charles-cowart
Copy link
Collaborator Author

Ty, yes I looked over the SAM-formatted results and tried -t an -T from https://samtools.github.io/hts-specs/SAMv1.pdf. Just got some acceptable output using 'BX', which is the first part of the descriptions themselves (BX:Z:CAGACACGTAGGTGGGAC). It's interesting because it's not position-based, even though it has what looks like a full SAM header. This looks awesome thanks Lucas! The only potential downside I can see is that we need to know the code (in this case 'BX') in advance. Is that something you guys would know or would the SPP have to discover it?

@charles-cowart
Copy link
Collaborator Author

Just to clarify, the command I'm using specifically is:
'cat intermediate_result.sam | samtools fastq -@ 8 -f 12 -F 256 -T BX'

@charles-cowart
Copy link
Collaborator Author

Confirmed minimap2 gracefully handles the case where '-y' is passed to it but there is no additional descriptions - output is as desired.
Confirmed samtools behaves gracefully when given nonexistent options w/-T along with BX (-T BX,AB) - if BX is present in the input it will appear in the output regardless of the other requested values.

@charles-cowart
Copy link
Collaborator Author

Ready for review! This version will rely on the user to supply a list of one or more tags such as 'BX' in order to preserve that metadata through host-filtering. The SPP plugin itself can keep a list of these around in configuration or the user can supply the tags they believe are present. We can alternatively attempt to scour the first few sequence-identifier lines for any present tags, but I'd like a few more samples before putting something together.

@antgonza
Copy link
Contributor

@charles-cowart, should we just preserve all tags? I might be missing something obvious about when we do not need them.

@charles-cowart
Copy link
Collaborator Author

charles-cowart commented Sep 27, 2024

@charles-cowart, should we just preserve all tags? I might be missing something obvious about when we do not need them.

Just to clarify, I did run several tests w/samtools to preserve all tags from minimap2's output. Samtools documentation says the following:

-T TAGLIST
Specify a comma-separated list of tags to copy to the FASTQ header line, if they exist. TAGLIST can be blank or * to indicate all >tags should be copied to the output. If using *, be careful to quote it to avoid unwanted shell expansion.

However, I found that "-T", "-T *", and "-T '*'" all resulted samtools displaying error messages and no generated output. The only way I've found so far to preserve the descriptions is to specify them 'by name' as '-T BX'. We can specify '-T BX,AB,CD,EF' and it will ignore non-existent tags. Note the version we currently use on barnacle is samtools 1.12.

Hence, my current thinking is that we can develop a list of tags as we encounter and/or need them, specify them in SPP's configuration file, and pass them to NuQCJob.

@charles-cowart
Copy link
Collaborator Author

Hi @wasade , @AmandaBirmingham, if it's no trouble, would you both mind giving it a final look over? TYVM!

# add tags for known metadata types that fastq files may have
# been annotated with. Samtools will safely ignore tags that
# are not present.
tags = " -T %s" % ','.join(self.additional_fastq_tags)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tags = " -T %s" % ','.join(self.additional_fastq_tags)
tags = "-T %s" % ','.join(self.additional_fastq_tags)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The odd spacing here is intentional. It prevents an extra space from being present when there are no tags. This makes the expected results for tests uniform.

f"{mmi_db_path} {input} -a | samtools fastq -@ "
f"{cores_to_allocate} -f 12 -F 256 > {output}")
f"{cores_to_allocate} -f 12 -F 256{tags} > "
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"{cores_to_allocate} -f 12 -F 256{tags} > "
f"{cores_to_allocate} -f 12 -F 256 {tags} > "

@charles-cowart
Copy link
Collaborator Author

Thank you everyone for all your input! I'm going to merge this and make it available to my other PRs.

@charles-cowart charles-cowart merged commit 9d5b7d4 into biocore:master Oct 4, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants