Intra-scaffold gap-filling is defined as filling the gaps (regions of 'N's) in scaffolds sequences of a draft genome assembly (reference genome).
Below is an example of the procedure used to perform intra-scaffold gap-filling with MTG-Link.
An example of the main outputs you should get is located in the outputs/
directory.
Input file:
- A sequences file in FASTA format
- the FASTA file contains scaffolds sequences with 'N's regions (where 'N's regions will be treated as gaps / target sequences) (reference genome).
Scripts: To generate a GFA file from a FASTA file containing 'N's regions, you have to use the scripts fasta2bed.py
and bed2gfa.py
(located in utils/
).
- The
fasta2bed.py
script takes as input a FASTA file and converts it to a BED file containing the positions of 'N's for each scaffold - The
bed2gfa.py
script converts a BED file containing the positions of 'N's for each scaffold to a GFA file (GFA 2.0)- it is possible to filter the 'N's regions you want to treat as gaps/targets (e.g. to select the gaps to assemble) by:
- their size (e.g. gap/target lengths):
-min
and-max
options - the flanking contigs' sizes (for example, select only the 'N's regions whose flanking contigs' sizes are long enough to get enough barcodes):
-contigs
option
- their size (e.g. gap/target lengths):
- it is possible to filter the 'N's regions you want to treat as gaps/targets (e.g. to select the gaps to assemble) by:
# Get a BED file containing the coordinates of the 'Ns' regions
../../utils/fasta2bed.py -fa referenceGenome.fasta -out targetSequenceCoordinates.bed
# Create a GFA file with the gaps/targets of length 1000 bp and whose the flanking contigs' sizes are at least 10000 bp
../../utils/bed2gfa.py -bed targetSequenceCoordinates.bed -fa referenceGenome.fasta -out targetSequenceCoordinates.gfa -min 1000 -max 1000 -contigs 10000
referenceGenome.fasta
: FASTA file containing the sequences of the scaffolds (reference genome)targetSequenceCoordinates.bed
: BED file obtained fromfasta2bed.py
- Outputs:
- GFA file:
targetSequenceCoordinates.gfa
- FASTA files containing the left flanking region of the gap/target:
referenceGenome_[scaffoldID]_[coordLeftFlankingSeq_start]-[coordLeftFlankingSeq_end].g1000.left.fasta
- FASTA files containing the right flanking region of the gap/target:
referenceGenome_[scaffoldID]_[coordRightFlankingSeq_start]-[coordRightFlankingSeq_end].g1000.right.fasta
- GFA file:
# Build the LRez barcode index (if not already done)
LRez index fastq -f readsFile.fastq.gz -o barcodeIndex.bci -g
# Run MTG-Link
../../mtglink.py DBG -gfa targetSequenceCoordinates.gfa -bam bamFile.bam -fastq readsFile.fastq.gz -index barcodeIndex.bci -k 61 51 41 31 21
readsFile.fastq.gz
: Linked-reads file (must be gzipped), with the barcode sequence in the header (BX:Z tag)targetSequenceCoordinates.gfa
: GFA file obtained frombed2gfa.py
bamFile.bam
: input BAM filebarcodeIndex.bci
: LRez barcode index of the FASTQ file, obtained withLRez index fastq
- Outputs: the main outputs of MTG-Link are the following:
- Output GFA file:
targetSequenceCoordinates_mtglink.gfa
- it is an assembly graph file in GFA format, that complements the input GFA file with the obtained assembled target sequences.
- Output FASTA file:
targetSequenceCoordinates.assembled_sequences.fasta
- it is a sequence file in FASTA format, that contains the set of assembled target sequences.
- Output GFA file:
NB: The outputs of MTG-Link are detailed here
The GFA file output by MTG-Link is an assembly graph that represents the relationships between the input sequences and the target sequences (assemblies or gaps). The input FASTA file can be updated with the assembled sequences using this GFA file. The gaps of the GFA file are returned as 'N's regions in the output FASTA file.
# Update the input FASTA file with assembled sequences
../../utils/gfa2tofasta.py -in ./MTG-Link_results/targetSequenceCoordinates_mtglink.gfa
targetSequenceCoordinates_mtglink.gfa
: GFA file output by MTG-Link- Output:
- FASTA file:
targetSequenceCoordinates_mtglink_assembly.fasta
- FASTA file: