Skip to content

Commit

Permalink
Replaced $log_{10}$ with $\log_{10}$ for improved formatting in VCF.
Browse files Browse the repository at this point in the history
  • Loading branch information
jkbonfield committed Nov 11, 2021
1 parent 4564330 commit 31feca3
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 16 deletions.
8 changes: 4 additions & 4 deletions VCFv4.2.tex
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ \subsubsection{Fixed fields}
\item ID - identifier: Semicolon-separated list of unique identifiers where available. If this is a dbSNP variant it is encouraged to use the rs number(s). No identifier should be present in more than one data record. If there is no identifier available, then the missing value should be used. (String, no whitespace or semicolons permitted)
\item REF - reference base(s): Each base must be one of A,C,G,T,N (case insensitive). Multiple bases are permitted. The value in the POS field refers to the position of the first base in the String. For simple insertions and deletions in which either the REF or one of the ALT alleles would otherwise be null/empty, the REF and ALT Strings must include the base before the event (which must be reflected in the POS field), unless the event occurs at position 1 on the contig in which case it must include the base after the event; this padding base is not required (although it is permitted) for e.g.\ complex substitutions or other events where all alleles have at least one base represented in their Strings. If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String ``$<$ID$>$'') then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism. Tools processing VCF files are not required to preserve case in the allele Strings. (String, Required).
\item ALT - alternate base(s): Comma separated list of alternate non-reference alleles. These alleles do not have to be called in any of the samples. Options are base Strings made up of the bases A,C,G,T,N,*, (case insensitive) or an angle-bracketed ID String (``$<$ID$>$'') or a breakend replacement string as described in the section on breakends. The `*' allele is reserved to indicate that the allele is missing due to a upstream deletion. If there are no alternative alleles, then the missing value should be used. Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive. (String; no whitespace, commas, or angle-brackets are permitted in the ID String itself)
\item QUAL - quality: Phred-scaled quality score for the assertion made in ALT. i.e.\ $-10log_{10}$ prob(call in ALT is wrong). If ALT is `.' (no variant) then this is $-10log_{10}$ prob(variant), and if ALT is not `.' this is $-10log_{10}$ prob(no variant). If unknown, the missing value should be specified. (Numeric)
\item QUAL - quality: Phred-scaled quality score for the assertion made in ALT. i.e.\ $-10\log_{10}$ prob(call in ALT is wrong). If ALT is `.' (no variant) then this is $-10\log_{10}$ prob(variant), and if ALT is not `.' this is $-10\log_{10}$ prob(no variant). If unknown, the missing value should be specified. (Numeric)
\item FILTER - filter status: PASS if this position has passed all filters, i.e., a call is made at this position. Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g.\ ``q10;s50'' might indicate that at this site the quality is below 10 and the number of samples with data is below 50\% of the total number of samples. `0' is reserved and should not be used as a filter String. If filters have not been applied, then this field should be set to the missing value. (String, no whitespace or semicolons permitted)
\item INFO - additional information: (String, no whitespace, semicolons, or equals-signs permitted; commas are permitted only as delimiters for lists of values) INFO fields are encoded as a semicolon-separated series of short keys with optional values in the format: $<$key$>$=$<$data$>$[,data]. If no keys are present, the missing value must be used. Arbitrary keys are permitted, although the following sub-fields are reserved (albeit optional):
\begin{itemize}
Expand Down Expand Up @@ -221,11 +221,11 @@ \subsubsection{Genotype fields}
\end{itemize}
\item DP : read depth at this position for this sample (Integer)
\item FT : sample genotype filter indicating if this genotype was ``called'' (similar in concept to the FILTER field). Again, use PASS to indicate that all filters have been passed, a semicolon-separated list of codes for filters that fail, or `.' to indicate that filters have not been applied. These values should be described in the meta-information in the same way as FILTERs (String, no whitespace or semicolons permitted)
\item GL : genotype likelihoods comprised of comma separated floating point $log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields. In presence of the GT field the same ploidy is expected and the canonical order is used; without GT field, diploidy is assumed. If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc. For example: GT:GL 0/1:-323.03,-99.29,-802.53 (Floats)
\item GL : genotype likelihoods comprised of comma separated floating point $\log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields. In presence of the GT field the same ploidy is expected and the canonical order is used; without GT field, diploidy is assumed. If A is the allele in REF and B,C,... are the alleles as ordered in ALT, the ordering of genotypes for the likelihoods is given by: F(j/k) = (k*(k+1)/2)+j. In other words, for biallelic sites the ordering is: AA,AB,BB; for triallelic sites the ordering is: AA,AB,BB,AC,BC,CC, etc. For example: GT:GL 0/1:-323.03,-99.29,-802.53 (Floats)
\item GLE : genotype likelihoods of heterogeneous ploidy, used in presence of uncertain copy number. For example: GLE=0:-75.22,1:-223.42,0/0:-323.03,1/0:-99.29,1/1:-802.53 (String)
\item PL : the $-10 \log_{10}$ scaled genotype likelihoods rounded to the closest integer, and otherwise defined in the same way as the GL field (Integers).
\item GP : the phred-scaled genotype posterior probabilities (and otherwise defined precisely as the GL field); intended to store imputed genotype probabilities (Floats)
\item GQ : conditional genotype quality, encoded as a phred quality $-10log_{10}$ p(genotype call is wrong, conditioned on the site's being variant) (Integer)
\item GQ : conditional genotype quality, encoded as a phred quality $-10\log_{10}$ p(genotype call is wrong, conditioned on the site's being variant) (Integer)
\item HQ : haplotype qualities, two comma separated phred qualities (Integers)
\item PS : phase set. A phase set is defined as a set of phased genotypes to which this genotype belongs. Phased genotypes for an individual that are on the same chromosome and have the same PS value are in the same phased set. A phase set specifies multi-marker haplotypes for the phased genotypes in the set. All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set. If the genotype in the GT field is unphased, the corresponding PS field is ignored. The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required). (Non-negative 32-bit Integer)
\item PQ : phasing quality, the phred-scaled probability that alleles are ordered incorrectly in a heterozygote (against all other members in the phase set). We note that we have not yet included the specific measure for precisely defining ``phasing quality''; our intention for now is simply to reserve the PQ tag for future use as a measure of phasing quality. (Integer)
Expand Down Expand Up @@ -311,7 +311,7 @@ \section{FORMAT keys used for structural variants}
##FORMAT=<ID=AHAP,Number=1,Type=Integer,Description="Unique identifier of ancestral haplotype">
\end{verbatim}
\normalsize
These keys are analogous to GT/GQ/GL and are provided for genotyping imprecise events by copy number (either because there is an unknown number of alternate alleles or because the haplotypes cannot be determined). CN specifies the integer copy number of the variant in this sample. CNQ is encoded as a phred quality $-10log_{10}$ p(copy number genotype call is wrong). CNL specifies a list of $log_{10}$ likelihoods for each potential copy number, starting from zero. When possible, GT/GQ/GL should be used instead of (or in addition to) these keys.
These keys are analogous to GT/GQ/GL and are provided for genotyping imprecise events by copy number (either because there is an unknown number of alternate alleles or because the haplotypes cannot be determined). CN specifies the integer copy number of the variant in this sample. CNQ is encoded as a phred quality $-10\log_{10}$ p(copy number genotype call is wrong). CNL specifies a list of $\log_{10}$ likelihoods for each potential copy number, starting from zero. When possible, GT/GQ/GL should be used instead of (or in addition to) these keys.

\section{Representing variation in VCF records}
\subsection{Creating VCF entries for SNPs and small indels}
Expand Down
12 changes: 6 additions & 6 deletions VCFv4.3.tex
Original file line number Diff line number Diff line change
Expand Up @@ -322,8 +322,8 @@ \subsubsection{Fixed fields}
In other words, the ALT field must be a symbolic allele, or a breakend replacement string, or match the regular expression \texttt{\^{}([ACGTNacgtn]+|\string\*|\string\.)\$}.
Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive.
(String; no whitespace, commas, or angle-brackets are permitted in the ID String itself)
\item QUAL --- quality: Phred-scaled quality score for the assertion made in ALT. i.e.\ $-10log_{10}$ prob(call in ALT is wrong).
If ALT is `.' (no variant) then this is $-10log_{10}$ prob(variant), and if ALT is not `.' this is $-10log_{10}$ prob(no variant).
\item QUAL --- quality: Phred-scaled quality score for the assertion made in ALT. i.e.\ $-10\log_{10}$ prob(call in ALT is wrong).
If ALT is `.' (no variant) then this is $-10\log_{10}$ prob(variant), and if ALT is not `.' this is $-10\log_{10}$ prob(no variant).
If unknown, the MISSING value must be specified. (Float)
\item FILTER --- filter status: PASS if this position has passed all filters, i.e., a call is made at this position.
Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g.\ ``q10;s50'' might indicate that at this site the quality is below 10 and the number of samples with data is below 50\% of the total number of samples.
Expand Down Expand Up @@ -443,7 +443,7 @@ \subsubsection{Genotype fields}
Again, use PASS to indicate that all filters have been passed, a semicolon-separated list of codes for filters that fail, or `.' to indicate that filters have not been applied.
These values should be described in the meta-information in the same way as FILTERs.
No whitespace or semicolons permitted.
\item GQ (Integer): Conditional genotype quality, encoded as a phred quality $-10log_{10}$ p(genotype call is wrong, conditioned on the site's being variant).
\item GQ (Integer): Conditional genotype quality, encoded as a phred quality $-10\log_{10}$ p(genotype call is wrong, conditioned on the site's being variant).
\item GP (Float): Genotype posterior probabilities in the range 0 to 1 using the same ordering as the GL field; one use can be to store imputed genotype probabilities.
\item GT (String): Genotype, encoded as allele values separated by either of $/$ or $\mid$.
The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on.
Expand All @@ -457,7 +457,7 @@ \subsubsection{Genotype fields}
\item $\mid$ : genotype phased
\end{itemize}
\item GL (Float): Genotype likelihoods comprised of comma separated floating point $log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields.
\item GL (Float): Genotype likelihoods comprised of comma separated floating point $\log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields.
In presence of the GT field the same ploidy is expected; without GT field, diploidy is assumed.
\textsc{Genotype Ordering.} \label{genotype-fields:genotype-ordering}
Expand Down Expand Up @@ -641,8 +641,8 @@ \section{FORMAT keys used for structural variants}
\normalsize
These keys are analogous to GT/GQ/GL/GP and are provided for genotyping imprecise events by copy number (either because there is an unknown number of alternate alleles or because the haplotypes cannot be determined).
CN specifies the integer copy number of the variant in this sample.
CNQ is encoded as a phred quality $-10log_{10}$ p(copy number genotype call is wrong).
CNL specifies a list of $log_{10}$ likelihoods for each potential copy number, starting from zero.
CNQ is encoded as a phred quality $-10\log_{10}$ p(copy number genotype call is wrong).
CNL specifies a list of $\log_{10}$ likelihoods for each potential copy number, starting from zero.
CNP is 0 to 1-scaled copy number posterior probabilities (and otherwise defined precisely as the CNL field), intended to store imputed genotype probabilities.
When possible, GT/GQ/GL/GP should be used instead of (or in addition to) these keys.
Expand Down
12 changes: 6 additions & 6 deletions VCFv4.4.draft.tex
Original file line number Diff line number Diff line change
Expand Up @@ -327,8 +327,8 @@ \subsubsection{Fixed fields}
In other words, the ALT field must be a symbolic allele, or a breakend replacement string, or match the regular expression \texttt{\^{}([ACGTNacgtn]+|\string\*|\string\.)\$}.
Tools processing VCF files are not required to preserve case in the allele String, except for IDs, which are case sensitive.
(String; no whitespace, commas, or angle-brackets are permitted in the ID String itself)
\item QUAL --- quality: Phred-scaled quality score for the assertion made in ALT. i.e.\ $-10log_{10}$ prob(call in ALT is wrong).
If ALT is `.' (no variant) then this is $-10log_{10}$ prob(variant), and if ALT is not `.' this is $-10log_{10}$ prob(no variant).
\item QUAL --- quality: Phred-scaled quality score for the assertion made in ALT. i.e.\ $-10\log_{10}$ prob(call in ALT is wrong).
If ALT is `.' (no variant) then this is $-10\log_{10}$ prob(variant), and if ALT is not `.' this is $-10\log_{10}$ prob(no variant).
If unknown, the MISSING value must be specified. (Float)
\item FILTER --- filter status: PASS if this position has passed all filters, i.e., a call is made at this position.
Otherwise, if the site has not passed all filters, a semicolon-separated list of codes for filters that fail. e.g.\ ``q10;s50'' might indicate that at this site the quality is below 10 and the number of samples with data is below 50\% of the total number of samples.
Expand Down Expand Up @@ -448,7 +448,7 @@ \subsubsection{Genotype fields}
Again, use PASS to indicate that all filters have been passed, a semicolon-separated list of codes for filters that fail, or `.' to indicate that filters have not been applied.
These values should be described in the meta-information in the same way as FILTERs.
No whitespace or semicolons permitted.
\item GQ (Integer): Conditional genotype quality, encoded as a phred quality $-10log_{10}$ p(genotype call is wrong, conditioned on the site's being variant).
\item GQ (Integer): Conditional genotype quality, encoded as a phred quality $-10\log_{10}$ p(genotype call is wrong, conditioned on the site's being variant).
\item GP (Float): Genotype posterior probabilities in the range 0 to 1 using the same ordering as the GL field; one use can be to store imputed genotype probabilities.
\item GT (String): Genotype, encoded as allele values separated by either of $/$ or $\mid$.
The allele values are 0 for the reference allele (what is in the REF field), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on.
Expand All @@ -462,7 +462,7 @@ \subsubsection{Genotype fields}
\item $\mid$ : genotype phased
\end{itemize}
\item GL (Float): Genotype likelihoods comprised of comma separated floating point $log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields.
\item GL (Float): Genotype likelihoods comprised of comma separated floating point $\log_{10}$-scaled likelihoods for all possible genotypes given the set of alleles defined in the REF and ALT fields.
In presence of the GT field the same ploidy is expected; without GT field, diploidy is assumed.
\textsc{Genotype Ordering.} \label{genotype-fields:genotype-ordering}
Expand Down Expand Up @@ -774,8 +774,8 @@ \section{FORMAT keys used for structural variants}
\normalsize
These keys are analogous to GT/GQ/GL/GP and are provided for genotyping imprecise events by copy number (either because there is an unknown number of alternate alleles or because the haplotypes cannot be determined).
CN specifies the integer copy number of the variant in this sample.
CNQ is encoded as a phred quality $-10log_{10}$ p(copy number genotype call is wrong).
CNL specifies a list of $log_{10}$ likelihoods for each potential copy number, starting from zero.
CNQ is encoded as a phred quality $-10\log_{10}$ p(copy number genotype call is wrong).
CNL specifies a list of $\log_{10}$ likelihoods for each potential copy number, starting from zero.
CNP is 0 to 1-scaled copy number posterior probabilities (and otherwise defined precisely as the CNL field), intended to store imputed genotype probabilities.
When possible, GT/GQ/GL/GP should be used instead of (or in addition to) these keys.
Expand Down

0 comments on commit 31feca3

Please sign in to comment.