To get started, modify the example file below:
See sections below for explainations of each field.
The path to a file specifying the input data sets. See :doc:`input` for instructions for creating the input file.
input: "data/input.yml"
Directory containing the results.
output_dir: "output/"
(Optional.) The name of the genome assembly. This is an optional parameter.
We use this parameter to automatically determine the values of genome
, annotation
and motif_file
.
If your genome assembly is not supported by the program, you will need to mannually
set the parameters for genome
, annotation
and motif_file
.
We currently support the following assemblies: "GRCh38", "hg38", "GRCm38" or "mm10".
assembly: "GRCh38"
(Optional.) Complete genome in a SINGLE ungzipped FASTA file.
If not specified, this will be downloaded according to assembly
.
genome: "/home/kai/genome/GRCh38/genome.fa"
(Optional.) Genome annotation in GTF format. For human and mouse, Gencode annotations are available at http://www.gencodegenes.org/. Very important: chromosome names in the annotations GTF file have to match chromosome names in the FASTA genome sequence file. For example, one can use ENSEMBL FASTA files with ENSEMBL GTF files, and UCSC FASTA files with UCSC FASTA files. However, since UCSC uses chr1, chr2,... naming convention, and ENSEMBL uses 1, 2, ... naming, the ENSEMBL and UCSC FASTA and GTF files cannot be mixed together.
If not specified, this will be downloaded according to assembly
.
annotation: "/home/kai/genome/GRCh38/gencode.v25.annotation.gtf"
(Optional.)
A plain text file containing PWMs in MEME format.
The naming convention for motifs is TF_NAME+OTHER_STRING
, where
the TF_NAME
should match the gene names in your annotation file.
You may include multipe PWMs for same TFs and use OTHER_STRING
to distinguish
them. For example, SP1+ID1
, SP1+ID2
.
Taiji will combine sites found from different PWMs.
We provide human and mouse motifs here (downloaded from cisBP database):
:download:`Human <data/motifs/cisBP_human.meme>`
and :download:`mouse <data/motifs/cisBP_mouse.meme>` motif files.
When motif_file
is not specified, these will be used according to
the value of the assembly
field.
motif_file: "/home/kai/motif_databases/cisBP_human.meme"
(Optional.)
This is the FILE containing customized genome sequence index.
The program will detect whether the file exists.
If the index is not present, it will generate the index at the specified location.
If you leave this parameter unspecified,
the program will generate the index in the output_dir
.
To avoid re-generate the index for every project, we recommend you to set this parameter mannually.
genome_index: "/home/kai/genome/GRCh38/GRCh38.index".
(Optional.)
BWA genome index prefix.
The program will detect whether the directory contain proper BWA indices.
If the indices are not present, it will generate the indices at the specified
location. If you leave this parameter unspecified,
the program will generate the indices in the output_dir
.
To avoid re-generate the indices for every project, we recommend you to set this parameter mannually.
bwa_index: "/home/kai/genome/GRCh38/BWAIndex/genome.fa"
The "-k" parameter in BWA -- minimum seed length.
bwa_seed_length: 32
(Optional.)
This is the DIRECTORY containing STAR INDICES.
The program will detect whether the directory contain proper STAR indices.
If the indices are not present, it will generate the indices within the specified
directory. If you leave this parameter unspecified,
the program will generate the indices in the output_dir
.
To avoid re-generate the indices for every project, we recommend you to set this parameter mannually.
star_index: "/home/kai/genome/GRCh38/STAR_index/"
(Optional.)
RSEM genome index prefix.
The program will detect whether the directory contain proper RSEM indices.
If the indices are not present, it will generate the indices at the specified
location. If you leave this parameter unspecified,
the program will generate the indices in the output_dir
.
To avoid re-generate the indices for every project, we recommend you to set this parameter mannually.
rsem_index: "/home/kai/genome/GRCh38/RSEM_index/genome"
(Optional.) FDR threshold for peak calling in MACS2.
callpeak_fdr: 0.01
(Optional.)
The effective genome size used for MACS2's "-g/--gsize" parameter.
This will be automatically determined based on the assembly or genome file.
For human or mouse assembly, we set this parameter to "hs" or "mm".
For other genome, we set this parameter to 0.9 * GENOME_SIZE
.
The value of this parameter usually doesn't make big difference.
callpeak_genome_size: "2.7e9"
(Optional.) External network file to be used in PageRank analysis.
external_network: "pathway.tsv"
(Optional.) The directory for storing temporary files.
tmp_dir: "/tmp"
Parameters related to scATAC-seq analysis. All parameters in this section should be specified under scatac_options. For example:
scatac_options: cluster_optimizer: CPM cluster_by_window: false cluster_resolution_list: [0.001, 0.002, 0.02, 0.05, 0.1, 0.25, 1] cluster_resolution: 0.0075 tss_enrichment_cutoff: 7 do_subclustering: true cluster_exclude: ["C3.3", "C25.3"] subcluster_resolution: "C1": 0.01 "C2": 0.03 "C4": 0.05 "C5": 0.06 "C6": 0.05 "C7": 0.03 "C8": 0.01 "C10": 0.001 "C11": 0.001
(Optional.) TSS enrichment cutoff for filtering cell in single cell ATAC-seq analysis.
tss_enrichment_cutoff: 7
(Optional.) Used to remove cells that do not have enough fragments/reads.
fragment_cutoff: 1000
- ::
- doublet_score_cutoff: 0.5
(Optional.)
- ::
- cell_barcode_length: 10
(Optional.) Quality function used in graph clustering. Available options are RBConfiguration and CPM. RBConfiguration optimizes modularity and has resolution limit while CPM is resolution-limit free.
cluster_optimizer: CPM
Mannually specify the clustering resolution. Otherwise it will be automatically determined.
cluster_resolution: 1
cluster_resolution_list: [0.005, 0.01, 0.02, 0.04, 0.08, 0.16, 0.32, 0.64, 0.8, 1]
Whether to use bin/window based clustering. Default is to use peak based clustering.
- ::
- cluster_by_window: False
Window/bin size.
- ::
- window_size: 5000
Whether to perform iterative clustering to identify subclusters.
- ::
- do_subclustering: False
Clusters to be excluded from the final result.
The length of the cell barcode used in demultiplexing.
scrna_cell_barcode_length: 12
The length of the UMI used in demultiplexing.
scrna_umi_length: 8
(Optional.) Cutoff for doublet detection, a value between 0 and 1 reflecting how likely a "cell" is a doublet. (default is 0.5)
scrna_doublet_score_cutoff: 0.5
The following settings are used in the cloud computing mode.
The command for submitting jobs.
submit_command: "qsub"
The template for specifying the number of cpu cores in the job submission script.
This is system/environment dependent.
For example, if your system uses -l nodes=1:ppn=XX
to allocate CPU resource,
you should put -l nodes=1:ppn=%d
here.
The %d
will be replaced by the actual numbers when submitting the jobs.
The CPU cores for individual steps can be modified in the :ref:`job-resource` section below.
submit_cpu_format: "-l nodes=1:ppn=%d"
The template for specifying the amount of memory in the job submission script.
This is system/environment dependent.
For example, if your system uses -l mem=XXG
to allocate memory resource,
you should put -l mem=%dG
here.
The %d
will be replaced by the actual numbers when submitting the jobs.
The memory for individual steps can be modified in the :ref:`job-resource` section below.
submit_memory_format: "-l mem=%dG"
Additional parameters to be included in the submission command.
submit_params: "-q glean"
(Optional.) Specify the computational resources for each step.
resource: SCATAC_Remove_Duplicates: parameter: "-q home -l walltime=24:00:00" SCATAC_Merged_Reduce_Dims: parameter: "-q home -l walltime=24:00:00" cpu: 4 memory: 80