Merge branch '2.x' into 2.0.0-ballot.2024-08

ga4gh · Sep 4, 2024 · 878fc80 · 878fc80
2 parents ae211f4 + d98a0f3
commit 878fc80
Show file tree

Hide file tree

Showing 53 changed files with 385 additions and 416 deletions.
diff --git a/docs/source/appendices/ga4gh_identifiers.rst b/docs/source/appendices/ga4gh_identifiers.rst
@@ -5,7 +5,7 @@ GA4GH Computed Identifier Alignment
 
 This appendix describes alignment on standard practices for
 for serializing data, computing digests on serialized data, and
-constructing CURIE identifiers from the digests.  Essentially, it is a
+constructing CURIE identifiers from the digests. Essentially, it is a
 generalization of the :ref:`computed-identifiers` section.
 
 This mechanism for generating identifiers has been in place
@@ -18,23 +18,23 @@ The GA4GH mission entails structuring, connecting, and sharing data
 reliably. A key component of this effort is to be able to *identify*
 entities, that is, to associate identifiers with entities. Ideally,
 there will be exactly one identifier for each entity, and one entity
-for each identifier.  Traditionally, identifiers are assigned to
+for each identifier. Traditionally, identifiers are assigned to
 entities, which means that disconnected groups must coordinate on
 identifier assignment.
 
-The computed identifier scheme used in VRS computes identifiers 
-from the data itself.  Because identifers depend on the data, groups 
-that independently generate the same variation will generate the same 
-computed identifier for that entity, thereby obviating centralized 
-identifier systems and enabling identifiers to be used in isolated 
-settings such as clinical labs. 
+The computed identifier scheme used in VRS computes identifiers
+from the data itself. Because identifiers depend on the data, groups
+that independently generate the same variation will generate the same
+computed identifier for that entity, thereby obviating centralized
+identifier systems and enabling identifiers to be used in isolated
+settings such as clinical labs.
 
 The computed identifier mechanism is broadly applicable and useful to
-the entire GA4GH ecosystem.  Adopting a common identifier scheme will
+the entire GA4GH ecosystem. Adopting a common identifier scheme will
 make interoperability of GA4GH entities more obvious to consumers,
 will enable the entire organization to share common entity definitions
 (such as sequence identifiers), and will enable all GA4GH products to
-share tooling that manipulate identified data.  In short, it provides
+share tooling that manipulate identified data. In short, it provides
 an important consistency within the GA4GH ecosystem.
 
 Here we detail alignment between VRS and other GA4GH products to work
@@ -70,7 +70,7 @@ reference:
 GA4GH Digest Keys
 #################
 When creating computed identifiers from objects, VRS uses a custom schema
-attribute, ``ga4ghDigest``, that contains the keys used for filtering out 
+attribute, *ga4ghDigest*, that contains the keys used for filtering out
 properties. For example, the Allele JSON Schema:
 
 .. parsed-literal::
@@ -95,8 +95,8 @@ properties. For example, the Allele JSON Schema:
 
 .. note::
 
-  The `ga4ghDigest` property names are currently being aligned with the Sequence 
-  Collections effort (see `SeqCol#84 <https://github.com/ga4gh/refget/issues/84>`_) 
+  The `ga4ghDigest` property names are currently being aligned with the Sequence
+  Collections effort (see `SeqCol#84 <https://github.com/ga4gh/refget/issues/84>`_)
   and may potentially change.
 
 GA4GH Type Prefixes
@@ -114,9 +114,9 @@ We use the following guidelines for type prefixes:
 
 * Prefixes SHOULD be short, approximately 2-4 characters.
 * Prefixes SHOULD be used only for concrete classes, not abstract parent classes.
-* Prefixes SHOULD be used only for stand-alone classes (e.g. :ref:`Variation`, :ref:`Location`), 
+* Prefixes SHOULD be used only for stand-alone classes (e.g. :ref:`Variation`, :ref:`Location`),
   not classes that require additional context to be meaningful (e.g. :ref:`Range`, :ref:`SequenceExpression`)
-  or are primarily used for adding descriptive context to external data types (e.g. :ref:`SequenceReference`) 
+  or are primarily used for adding descriptive context to external data types (e.g. :ref:`SequenceReference`)
 * A prefix MUST map 1:1 with a schema.
 
 Administration

diff --git a/docs/source/appendices/glossary.rst b/docs/source/appendices/glossary.rst
@@ -12,10 +12,10 @@ Glossary
       data.
 
    digest, ga4gh_digest
-      A digest is a digital fingerprint of a block of binary data.  A
+      A digest is a digital fingerprint of a block of binary data. A
       digest is always the same size, regardless of the size of the
-      input data.  It is statistically extremely unlikely for two
-      fingerprints to match when the underlying data are distinct. 
+      input data. It is statistically extremely unlikely for two
+      fingerprints to match when the underlying data are distinct.
 
    identifiable object
       An identifiable object in VRS is any data structure for

diff --git a/docs/source/appendices/truncated_digest_collision_analysis.rst b/docs/source/appendices/truncated_digest_collision_analysis.rst
@@ -12,7 +12,7 @@ of truncation length.
   <https://github.com/biocommons/biocommons.seqrepo/blob/master/docs/Truncated%20Digest%20Collision%20Analysis.ipynb>`__
   in `Python SeqRepo library
   <https://github.com/biocommons/biocommons.seqrepo>`__ for code and
-  updates.  A fuller explanation is given in [Hart2020]_.
+  updates. A fuller explanation is given in [Hart2020]_.
 
 
 Conclusions
@@ -30,11 +30,11 @@ Conclusions
     import hashlib
     import math
     import timeit
-    
+
     from IPython.display import display, Markdown
-    
+
     from ga4gh.vrs.extras.utils import _format_time
-    
+
     algorithms = {'sha512', 'sha1', 'sha256', 'md5', 'sha224', 'sha384'}
 
 
@@ -49,16 +49,16 @@ basis for the Truncated Digest.
     def blob(l):
         """return binary blob of length l (POSIX only)"""
         return open("/dev/urandom", "rb").read(l)
-    
+
     def digest(alg, blob):
         md = hashlib.new(alg)
         md.update(blob)
         return md.digest()
-    
+
     def magic_run1(alg, blob):
         t = %timeit -o digest(alg, blob)
         return t
-    
+
     def magic_tfmt(t):
         """format TimeitResult for table"""
         return "{a} ± {s} ([{b}, {w}])".format(
@@ -159,15 +159,15 @@ in a corpus is difficult. Instead, we first seek to solve for
 the digests are unique). Because are only two outcomes,
 :math:`P + P' = 1` or, equivalently, :math:`P = 1 - P'`.
 
-For a corpus of size :math:`m=1`, the probabability that the digests of
+For a corpus of size :math:`m=1`, the probability that the digests of
 all :math:`m=1` messages are unique is (trivially) 1:
 
 .. math:: P' = s/s = 1
 
 because there are :math:`s` ways to choose the first digest from among
 :math:`s` possible values without a collision.
 
-For a corpus of size :math:`m=2`, the probabability that the digests of
+For a corpus of size :math:`m=2`, the probability that the digests of
 all :math:`m=2` messages are unique is:
 
 .. math:: P' = 1 \times (\frac{s-1}{s})
@@ -211,7 +211,7 @@ The Taylor series expansion of the exponential function is
 .. math:: e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + ...
 
 For :math:`|x| \ll 1`, the expansion is dominated by the first terms and
-therecore :math:`e^x \approx 1 + x`.
+therefore :math:`e^x \approx 1 + x`.
 
 In the above expression for :math:`P'`, note that the product term
 :math:`(s-i)/s` is equivalent to :math:`1-i/s`. Combining this with the
@@ -270,13 +270,13 @@ collisions.
      - Assumptions
      - Source/Comparison
    * - exact
-     - :math:`\prod_\nolimits{i=0}^{m-1} \frac{(s-i)}{s}`     
+     - :math:`\prod_\nolimits{i=0}^{m-1} \frac{(s-i)}{s}`
      - :math:`1-P'`
      - :math:`1 \le m\le s`
      - [1]
    * - Taylor approximation on #1
      - :math:`e^{-m(m-1)/2s}`
-     - :math:`1-P'` 
+     - :math:`1-P'`
      - :math:`m \ll s`
      - [1]
    * - Taylor approximation on #2
@@ -286,7 +286,7 @@ collisions.
      - [1]
    * - Large square approximation
      - :math:`1 - \frac{m^2}{2s}`
-     - :math:`\frac{m^2}{2s}` 
+     - :math:`\frac{m^2}{2s}`
      - (same)
      - [2] (where :math:`s=2^n`)
 
@@ -347,20 +347,20 @@ This equation is not used further in this analysis.
 
     def b2B3(b):
         """Convert bits b to Bytes, rounded up modulo 3
-    
+
         We report modulo 3 because the intent will be to use Base64 encoding, which is
         most efficient when inputs have a byte length modulo 3. (Otherwise, the resulting
         string is padded with characters that provide no information.)
-        
+
         """
         return math.ceil(b/8/3) * 3
-        
+
     def B(P, m):
         """return the number of bits needed to achieve a collision probability
         P for m messages
-    
+
         Assumes m << 2^b.
-        
+
         """
         b = math.log2(m**2 / P) - 1
         if b < 5 + math.log2(m):
@@ -417,4 +417,3 @@ digest length (bytes) required for expected collision probability :math:`P` over
 | 1e+ | 39  | 39  | 36  | 36  | 33  | 33  | 30  | 30  | 30  | 27  | 27  |
 | 30  |     |     |     |     |     |     |     |     |     |     |     |
 +-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
-
diff --git a/docs/source/concepts/LocationAndReference/SequenceLocation.rst b/docs/source/concepts/LocationAndReference/SequenceLocation.rst
@@ -5,7 +5,7 @@ Sequence Location
 
 The sequence location class is a fundamental concept in VRS. Sequence locations are used to describe every form of :ref:`Variation`,
 and they have stand-alone utility for describing sequence locations in other (non-variation) contexts.
-This class is used to represent a location on a specified :ref:`SequenceReference`. The sequence reference is typically a 
+This class is used to represent a location on a specified :ref:`SequenceReference`. The sequence reference is typically a
 chromosome, transcript, or protein sequence.
 
 Definition and Information Model
@@ -19,25 +19,25 @@ Implementation Guidance
 Start, End, and Ranges
 ######################
 
-At least one of the ``start`` and ``end`` properties MUST be specified in any ``SequenceLocation`` instance.
+At least one of the *start* and *end* properties MUST be specified in any ``SequenceLocation`` instance.
 When only one of these properties is specified, this represents an open interval beginning at the specified
-coordinate and extending left (when ``start`` is ``null``) or right (when ``end`` is ``null``).
+coordinate and extending left (when *start* is ``null``) or right (when *end* is ``null``).
 
-When there is ambiguity at a coordinate (e.g., when using a SequenceLocation to describe the confidence boundary 
+When there is ambiguity at a coordinate (e.g., when using a ``SequenceLocation`` to describe the confidence boundary
 of a copy number segment), this is specified using the :ref:`Range` class for that coordinate.
 
 .. admonition:: New in v2
 
-    In VRS v1, the ``SequenceLocation`` class had an ``interval`` property which contained ``start`` and ``end``
-    attributes. This intermediate object layer has been removed in v2.0, making ``start`` and ``end``
+    In VRS v1, the ``SequenceLocation`` class had an *interval* property which contained *start* and *end*
+    attributes. This intermediate object layer has been removed in v2.0, making *start* and *end*
     top-level properties of the ``SequenceLocation``.
 
 The "Ref" Allele
 ################
 
 In some variant representation formats (e.g. HGVS, VCF) sequence variants are described by both their "reference"
 (ref) and "alternate" (alt) alleles. When representing an Allele with VRS v2, it is also possible to describe the
-ref sequence (derived from the :ref:SequenceReference at the location) using the `sequence` property.
+ref sequence (derived from the :ref:SequenceReference at the location) using the *sequence* property.
 
 .. admonition:: New in v2
 
@@ -49,21 +49,21 @@ Linear and Circular Sequence Coordinates
 
 When representing a linear sequence, it is expected that for a :ref:`Sequence` of length *n*, ``0 ≤ start ≤ end ≤ n``
 
-For a circular sequence, ``0 ≤ end ≤ start ≤ n`` is also allowed. In cases where ``end < start``, this represents 
+For a circular sequence, ``0 ≤ end ≤ start ≤ n`` is also allowed. In cases where ``end < start``, this represents
 a location that spans the circular sequence origin coordinate.
 
 .. admonition:: New in v2
 
-    The v2 ``SequenceLocation`` now also supports circular sequences. The optional ``circular`` property of the 
+    The v2 ``SequenceLocation`` now also supports circular sequences. The optional *circular* property of the
     :ref:`SequenceReference` class may be set to ``True`` or ``False`` to explicitly indicate if a reference is
     circular, and therefore if ``0 ≤ end ≤ start ≤ n`` is also allowed.
 
 Implied Sequence Coordinates
 ############################
 
-The *Sequence Location* class refers to coordinates on a :ref:`SequenceReference`; if that sequence 
+The ``Sequence Location`` class refers to coordinates on a :ref:`SequenceReference`; if that sequence
 represents a coding transcript, then the coordinates refer to the coding transcript, and not a
-chromosome sequence to which it aligns. VRS intentionally does not allow for `start` or `end` values
+chromosome sequence to which it aligns. VRS intentionally does not allow for *start* or *end* values
 that use an offset system to represent sequence not found on the :ref:`SequenceReference`.
 
 .. TODO:: Describe and add a ref to an intronic variant profile
diff --git a/docs/source/concepts/MolecularVariation/Adjacency.rst b/docs/source/concepts/MolecularVariation/Adjacency.rst
@@ -7,9 +7,9 @@ Adjacency
 
    The Adjacency class was added in v2 to describe structural variation.
 
-The adjacency class is a core concept for structural variation, representing the junction point of 
+The adjacency class is a core concept for structural variation, representing the junction point of
 two adjoined molecules. This class can be used on its own (e.g. for junctions of chimeric transcript fusions)
-or in higher order structures such as :ref:`DerivativMolecule` to represent molecules derived from multiple
+or in higher order structures such as :ref:`DerivativeMolecule` to represent molecules derived from multiple
 adjacencies (e.g. for translocations).
 
 Definition and Information Model
@@ -28,7 +28,7 @@ of the provided :ref:`SequenceReference`. These types of adjacencies are common
 can be found, for example, on either end of a chromosomal inversion.
 
 To represent this, the :ref:`SequenceLocation` used by each partner of the adjacency is defined using
-only one of the `start` or `end` attributes. Defining the location by `start` means that the sequence content 
+only one of the `start` or `end` attributes. Defining the location by `start` means that the sequence content
 extends right (increases) on the :ref:`SequenceReference`, and defining the location by `end` means that the
 sequence content extends left (decreases) on the :ref:`SequenceReference`.
 
@@ -41,18 +41,18 @@ sequence content extends left (decreases) on the :ref:`SequenceReference`.
 .. figure:: ../../images/ex_revcomp_breakpoint.png
 
    **An example Adjacency with a reverse complement partner.** The chromosome 1 sequence extends left from
-   position 1:87337011 and so is defined by the location `start`. The chromosome 10 sequence *also* extends left 
+   position 1:87337011 and so is defined by the location `start`. The chromosome 10 sequence *also* extends left
    from position 10:36119127 and so is *also* defined by the location `start`. Reading left-to-right along this
    adjacency one would expect reference sequence up to the adjacency and reverse complement sequence following.
 
 Normalization
 #############
 
-Conventions for ordering sequences and handling ambiguous sequence Adjacencies are described in 
+Conventions for ordering sequences and handling ambiguous sequence Adjacencies are described in
 :ref:`adjacency-normalization`.
 
 Linker Sequences
 ################
 
-Intervening sequences between adjoined sequences in an adjacency are called *linker sequences* and may be specified 
+Intervening sequences between adjoined sequences in an adjacency are called *linker sequences* and may be specified
 with a :ref:`SequenceExpression`.`
diff --git a/docs/source/concepts/MolecularVariation/Allele.rst b/docs/source/concepts/MolecularVariation/Allele.rst
@@ -3,8 +3,8 @@
 Allele
 !!!!!!
 
-The allele class is used for representing contiguous changes on a reference sequence. This class covers the most 
-commonly described forms of variation, including all "small" variants such as SNVs and indels that are also representable 
+The allele class is used for representing contiguous changes on a reference sequence. This class covers the most
+commonly described forms of variation, including all "small" variants such as SNVs and indels that are also representable
 in other contemporary genomic variant formats, such as SPDI, HGVS, and VCF.
 
 Definition and Information Model
@@ -18,24 +18,24 @@ Implementation Guidance
 Sequence Location Coordinates
 #############################
 
-The ``location`` property of the allele will almost always have ``start`` and ``end`` coordinates that are specified using
+The *location* property of the allele will almost always have *start* and *end* coordinates that are specified using
 integers (not :ref:`Range`). There are some situations, such as the detection of deleted sequence by microarray, where it may
 be appropriate to represent the variant as an Allele; however, other classes for representing such findings should also be
 considered (e.g. :ref:`CopyNumberCount`).
 
 Normalization
 #############
 
-The ``Allele`` also includes conventions for variant normalization (see :ref:`allele-normalization`) that allows for compact and 
+The ``Allele`` also includes conventions for variant normalization (see :ref:`allele-normalization`) that allows for compact and
 uniform representation of variants.
 
 .. admonition:: New in v2
 
     In VRS v1.x, normalization included methods for full justification of variants, as derived from the NCBI `VOCA`_ algorithm.
-    In v2, this has been extended to include reference length encoding (see :ref:`ReferenceLengthExpression`), to 
+    In v2, this has been extended to include reference length encoding (see :ref:`ReferenceLengthExpression`), to
     accommodate compressed representation of variants that occur in large repetitive regions.
 
-    For alleles in small repeating regions, it may be convenient to also use the ``ReferenceLengthExpression.sequence`` attribute
+    For alleles in small repeating regions, it may be convenient to also use the *ReferenceLengthExpression.sequence* attribute
     to represent the sequence state explicitly alongside the reference encoding.
 
 .. _VOCA: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7523648/
@@ -46,4 +46,4 @@ Expressions
 .. admonition:: New in v2
 
     The v2 :ref:`variation` classes now support :ref:`expressions`. This is a convenient mechanism for annotating Alleles using
-    string syntaxes following the conventions other variant standards (e.g. HGVS, SPDI) and resources (e.g. ClinVar, gnomAD).
+    string syntaxes following the conventions other variant standards (e.g. HGVS, SPDI) and resources (e.g. ClinVar, gnomAD).