cli-assemble-help.txt

usage: MCMC haplotype assembly [-h] [--region REGION] [--region-id REGION_ID]
                               [--targets TARGETS] [--variants VARIANTS]
                               [--bam BAM [BAM ...]] [--ploidy PLOIDY]
                               [--inbreeding INBREEDING]
                               [--sample-pool SAMPLE_POOL]
                               [--reference REFERENCE]
                               [--base-error-rate BASE_ERROR_RATE]
                               [--use-base-phred-scores]
                               [--mapping-quality MAPPING_QUALITY]
                               [--keep-duplicate-reads] [--keep-qcfail-reads]
                               [--keep-supplementary-reads]
                               [--read-group-field READ_GROUP_FIELD]
                               [--report [REPORT ...]] [--cores CORES]
                               [--mcmc-chains MCMC_CHAINS]
                               [--mcmc-steps MCMC_STEPS]
                               [--mcmc-burn MCMC_BURN] [--mcmc-seed MCMC_SEED]
                               [--mcmc-chain-incongruence-threshold MCMC_CHAIN_INCONGRUENCE_THRESHOLD]
                               [--mcmc-fix-homozygous MCMC_FIX_HOMOZYGOUS]
                               [--mcmc-llk-cache-threshold MCMC_LLK_CACHE_THRESHOLD]
                               [--mcmc-recombination-step-probability MCMC_RECOMBINATION_STEP_PROBABILITY]
                               [--mcmc-dosage-step-probability MCMC_DOSAGE_STEP_PROBABILITY]
                               [--mcmc-partial-dosage-step-probability MCMC_PARTIAL_DOSAGE_STEP_PROBABILITY]
                               [--mcmc-temperatures [MCMC_TEMPERATURES ...]]
                               [--haplotype-posterior-threshold HAPLOTYPE_POSTERIOR_THRESHOLD]

options:
  -h, --help            show this help message and exit
  --region REGION       Specify a single target region with the format
                        contig:start-stop. This region will be a single
                        variant in the output VCF. This argument can not be
                        combined with the --targets argument.
  --region-id REGION_ID
                        Specify an identifier for the locus specified with the
                        --region argument. This id will be reported in the
                        output VCF.
  --targets TARGETS     Bed file containing multiple genomic intervals for
                        haplotype assembly. First three columns (contig,
                        start, stop) are mandatory. If present, the fourth
                        column (id) will be used as the variant id in the
                        output VCF.This argument can not be combined with the
                        --region argument.
  --variants VARIANTS   Tabix indexed VCF file containing SNP variants to be
                        used in assembly. Assembled haplotypes will only
                        contain the reference and alternate alleles specified
                        within this file.
  --bam BAM [BAM ...]   Bam file(s) to use in analysis. This may be (1) a list
                        of one or more bam filepaths, (2) a plain-text file
                        containing a single bam filepath on each line, (3) a
                        plain-text file containing a sample identifier and its
                        corresponding bam filepath on each line separated by a
                        tab. If options (1) or (2) are used then all samples
                        within each bam will be used within the analysis. If
                        option (3) is used then only the specified sample will
                        be extracted from each bam file and An error will be
                        raised if a sample is not found within its specified
                        bam file.
  --ploidy PLOIDY       Specify sample ploidy (default = 2).This may be (1) a
                        single integer used to specify the ploidy of all
                        samples or (2) a file containing a list of all samples
                        and their ploidy. If option (2) is used then each line
                        of the plaintext file must contain a single sample
                        identifier and the ploidy of that sample separated by
                        a tab.
  --inbreeding INBREEDING
                        Specify expected sample inbreeding coefficient
                        (default = 0.0).This may be (1) a single floating
                        point value in the interval [0, 1] used to specify the
                        inbreeding coefficient of all samples or (2) a file
                        containing a list of all samples and their inbreeding
                        coefficient. If option (2) is used then each line of
                        the plaintext file must contain a single sample
                        identifier and the inbreeding coefficient of that
                        sample separated by a tab.
  --sample-pool SAMPLE_POOL
                        WARNING: this is an experimental feature!!! Pool
                        samples together into a single genotype. This may be
                        (1) the name of a single pool for all samples or (2) a
                        file containing a list of all samples and their
                        assigned pool. If option (2) is used then each line of
                        the plaintext file must contain a single sample
                        identifier and the name of a pool separated by a
                        tab.Samples may be assigned to multiple pools by using
                        the same sample name on multiple lines.Each pool will
                        treated as a single genotype by combining all reads
                        from its constituent samples. Note that the pool names
                        should be used in place of the samples names when
                        assigning other per-sample parameters such as ploidy
                        or inbreeding coefficients.
  --reference REFERENCE
                        Indexed fasta file containing the reference genome.
  --base-error-rate BASE_ERROR_RATE
                        Expected base error rate of read sequences (default =
                        0.0024). The default value comes from Pfeiffer et al
                        2018 and is a general estimate for Illumina short
                        reads.
  --use-base-phred-scores
                        Flag: use base phred-scores as a source of base error
                        rate. This will use the phred-encoded per base scores
                        in addition to the general error rate specified by the
                        --base-error-rate argument. Using this option can slow
                        down assembly speed.
  --mapping-quality MAPPING_QUALITY
                        Minimum mapping quality of reads used in assembly
                        (default = 20).
  --keep-duplicate-reads
                        Flag: Use reads marked as duplicates in the assembly
                        (these are skipped by default).
  --keep-qcfail-reads   Flag: Use reads marked as qcfail in the assembly
                        (these are skipped by default).
  --keep-supplementary-reads
                        Flag: Use reads marked as supplementary in the
                        assembly (these are skipped by default).
  --read-group-field READ_GROUP_FIELD
                        Read group field to use as sample id (default = "SM").
                        The chosen field determines tha sample ids required in
                        other input files e.g. the --sample-list argument.
  --report [REPORT ...]
                        Extra fields to report within the output VCF. The
                        INFO/FORMAT prefix may be omitted to return both
                        variations of the named field. Options include:
                        INFO/AFPRIOR = Prior allele frequencies; INFO/ACP =
                        Posterior allele counts; INFO/AFP = Posterior mean
                        allele frequencies; INFO/AOP = Posterior probability
                        of allele occurring across all samples; INFO/AOPSUM =
                        Posterior estimate of the number of samples containing
                        an allele; INFO/SNVDP = Read depth at each SNV
                        position; FORMAT/ACP: Posterior allele counts;
                        FORMAT/AFP: Posterior mean allele frequencies;
                        FORMAT/AOP: Posterior probability of allele occurring;
                        FORMAT/GP: Genotype posterior probabilities;
                        FORMAT/GL: Genotype likelihoods; FORMAT/SNVDP: Read
                        depth at each SNV position
  --cores CORES         Number of cpu cores to use (default = 1).
  --mcmc-chains MCMC_CHAINS
                        Number of independent MCMC chains per assembly
                        (default = 2).
  --mcmc-steps MCMC_STEPS
                        Number of steps to simulate in each MCMC chain
                        (default = 2000).
  --mcmc-burn MCMC_BURN
                        Number of initial steps to discard from each MCMC
                        chain (default = 1000).
  --mcmc-seed MCMC_SEED
                        Random seed for MCMC (default = 42).
  --mcmc-chain-incongruence-threshold MCMC_CHAIN_INCONGRUENCE_THRESHOLD
                        Posterior probability threshold for identification of
                        incongruent posterior modes (default = 0.60).
  --mcmc-fix-homozygous MCMC_FIX_HOMOZYGOUS
                        Fix alleles that are homozygous with a probability
                        greater than or equal to the specified value (default
                        = 0.999). The probability of that a variant is
                        homozygous in a sample is assessed independently for
                        each variant prior to MCMC simulation. If an allele is
                        "fixed" it is not allowed vary within the MCMC thereby
                        reducing computational complexity.
  --mcmc-llk-cache-threshold MCMC_LLK_CACHE_THRESHOLD
                        Threshold for determining whether to cache log-
                        likelihoods during MCMC to improve performance. This
                        value is computed as ploidy * variants * unique-reads
                        (default = 100). If set to 0 then log-likelihoods will
                        be cached for all samples including those with few
                        observed reads which is inefficient and can slow the
                        MCMC. If set to -1 then log-likelihood caching will be
                        disabled for all samples.
  --mcmc-recombination-step-probability MCMC_RECOMBINATION_STEP_PROBABILITY
                        Probability of performing a recombination sub-step
                        during each step of the MCMC. (default = 0.5).
  --mcmc-dosage-step-probability MCMC_DOSAGE_STEP_PROBABILITY
                        Probability of performing a dosage sub-step during
                        each step of the MCMC. (default = 1.0).
  --mcmc-partial-dosage-step-probability MCMC_PARTIAL_DOSAGE_STEP_PROBABILITY
                        Probability of performing a within-interval dosage
                        sub-step during each step of the MCMC. (default =
                        0.5).
  --mcmc-temperatures [MCMC_TEMPERATURES ...]
                        Specify inverse-temperatures to use for parallel
                        tempered chains (default = 1.0 i.e., no tempering).
                        This may be either (1) a list of floating point values
                        or (2) a file containing a list of samples with mcmc
                        inverse-temperatures. If option (2) is used then the
                        file must contain a single sample per line followed by
                        a list of tab separated inverse temperatures. The
                        number of inverse-temperatures may differ between
                        samples and any samples not included in the list will
                        default to not using tempering.
  --haplotype-posterior-threshold HAPLOTYPE_POSTERIOR_THRESHOLD
                        Posterior probability required for a haplotype to be
                        included in the output VCF as an alternative allele.
                        The posterior probability of each haplotype is
                        assessed per individual and calculated as the
                        probability of that haplotype being present with one
                        or more copies in that individual.A haplotype is
                        included as an alternate allele if it meets this
                        posterior probability threshold in at least one
                        individual. This parameter is the main mechanism to
                        control the number of alternate alleles in ech VCF
                        record and hence the number of genotypes assessed when
                        recalculating likelihoods and posterior distributions
                        (default = 0.20).