Skip to content
Wang Yunfei edited this page Feb 9, 2017 · 1 revision

date: 2013-10-28 20:36:44 name: SeqUtils

Sequence Utilities

  • wFormatFasta.py
  • wGCContent.py
  • wGetSeqByName.py
  • wGetSeqByPosition.py

Things you should know before starting

  1. In Fasta file, always set all sequence lines with fixed length less than 120 characters. (We set it to 100bp for fast calculation.)
  2. Sequence names with spaces or tabs are not recommended. The contents after spaces or tabs are usually omitted by most programs. (Use "_" or "|" to replace spaces and tabs if possible.)
  3. Sequences with spaces or tabs are not allowed.
  4. Coordinates in Fasta file are 1-bases.
  5. We accept input from "stdin" or "pipe" (set input file name to "stdin"), and by default the output is "stdout".

wFormatFasta.py: Format Fasta file with fixed sequence length each line.

Example: python wFormatFasta.py -i test.fa -l 100 -o test_formated.fa

usage: wFormatFasta.py [-h] -i input.fa [-l length] [-o output.fa]

Format sequences in Fasta file to fixed length.

Options:
  -h, --help            show this help message and exit
  -i input.fa, --input input.fa
                        Fasta file name. Can be "stdin".
  -l length, --length length
                        Length of each line. Default is 100.
  -o output.fa, --output output.fa
                        Output file name. Default is stdout.

dependency ngslib

wGCContet.py: GC content of each sequence in a Fasta file.

Example: python wGCContent.py -i test.test -o test.gc

usage: wGCContent.py [-h] -i input.fa [-o output.gc]

Calculate GC content of Fasta file.

Options:
  -h, --help            show this help message and exit
  -i input.fa, --input input.fa
                        Fasta file name. Can be "stdin".
  -o output.gc, --output output.gc
                        GC content file name. Default = stdout.

dependency ngslib

wGetSeqByName.py: get sequences from a Fasta file by a list of names.

Example: python wGetSeqByName.py -i test.fa -n names.lst -o test_with_names.fa

usage: wGetSeqByName.py [-h] -i input.fa -n names.lst [-o output.fa]

Get sequences by a list of names.

Options:
  -h, --help            show this help message and exit
  -i input.fa, --input input.fa
                        Fasta file name.
  -n names.lst, --names names.lst
                        A file with sequence names. Can be "stdin".
  -o output.fa, --output output.fa
                        Output file name. Default is "stdout".

dependency ngslib

wGetSeqByCoordinates.py: Get sequence by genome coordinates (1-based)

Example: python wGetSeqByCoordinates.py -i test.fa -r 'chr1:-:-'

usage: wGetSeqByCoordinates.py [-h] -i input.fa [-r chr1:100-200:+] [-c chrom]
                               [-s start] [-e end] [-t strand] [-o output.fa]

Get a fragment from a Fasta file.

Options:
  -h, --help            show this help message and exit
  -i input.fa, --input input.fa
                        Fasta file name.
  -r chr1:100-200:+, --region chr1:100-200:+
                        Chromosome region. Leave it empty if not applicable,
                        i.e. "chr1:100-:-".
  -c chrom, --chrom chrom
                        chromosome name.
  -s start, --start start
                        start coordinate. Default: begining of the chromosome.
  -e end, --end end     end coordiante. Default: end of the chromosome.
  -t strand, --strand strand
                        strand. Default: "+"
  -o output.fa, --output output.fa
                        Output file name. Default: stdout

dependency ngslib