Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reading UTF-8 encoded sample names when char is signed #1237

Merged
merged 1 commit into from
Feb 17, 2021

Conversation

daviesrob
Copy link
Member

The trick used in bcf_hdr_parse_sample_line() to rapidly find tabs and newlines could be defeated by UTF-8 characters outside the Basic Latin range on platforms where "char" is signed (like x86). It's currently not clear if VCF intends to allow these, but the 4.3 specification does allow UTF-8 and it's easy enough to support. Fix by casting to unsigned when making the comparison.

Modifies formatcols.vcf to include a UTF-8 character for a round-trip test.

Fixes samtools/bcftools#1408

The trick used in bcf_hdr_parse_sample_line() to rapidly find tabs
and newlines could be defeated by UTF-8 characters outside the
Basic Latin range on platforms where "char" is signed (like x86).
It's currently not clear if VCF intends to allow these, but the
4.3 specification does allow UTF-8 and it's easy enough to support.
Fix by casting to unsigned when making the comparison.

Modifies formatcols.vcf to include a UTF-8 character for a
round-trip test.

Fixes samtools/bcftools#1408
@valeriuo valeriuo merged commit 8127bfc into samtools:develop Feb 17, 2021
@daviesrob daviesrob deleted the utf8-samples branch February 17, 2021 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Possible bug in htslib/bcftools 1.1: [E::bcf_hdr_add_sample_len] Duplicated sample name
2 participants