Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification of empty vector representation #617

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions VCFv4.3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -1800,6 +1800,7 @@ \subsubsection{Type encoding}
For example, one sample could have CN0:0,CN1:10 and another CN0:0,CN1:10,CN2:10.
In the situation when a genotype field contain vector values of different lengths, these are represented in BCF2 by a vector of the maximum length per sample, with all values in the each vector aligned to the left, and END\_OF\_VECTOR values assigned to all values not present in the original vector.
The BCF2 encoder / decoder must automatically add and remove these END\_OF\_VECTOR values from the vectors. Note that the use of END\_OF\_VECTOR means that it is legal to encode a vector VCF field with MISSING values.
Empty vectors (i.e. vectors with no data available) are represented by one MISSING value followed by as many END\_OF\_VECTOR values as are required to pad the vector to the appropriate length.

For example, suppose I have two samples, each with a FORMAT field X.
Sample A has values [1], while sample B has [2,3].
Expand Down