faformat manual

faformat is a tool to format fasta file:

  1. flatten a single line sequence to multiple lines
  2. remove, get, sample sequence from fasta file by sequence id or regex expression
  3. convert fasta file to a sto file

An example fasta file: download (3000 sequences)

>XM_005600090.2 100054601|LOC100054601|protein_coding
ACAGGACCCCTGCTTGCTGCTATCTGTGTGAGCCTCCCTCTGTGCTCTTTGATATAGTTC
TTGGGGATGTGATTTGCTGTGTATGATTGCTTGCCTTTTCCTTGATTATGTAAATCAGGG
ACTTTTGCCAGGAACATTCCATTCCCGGAAACTAAGCTGGTTCCCTCCCATCGGCTGACT
TGGTTTCATTTTACTGAAGTTTTTCAAGTGGCAGAGCAGTAATAACTGTCTGTGCCTCTT
GGCAGTGTGATACCTGGAGTTCAGAACCCTAAACGGTGACAATGACAGCAGATGAGTTGG
TTTTCTTTGTCAATGGCAAAAAGGTGGTGGAGAAAAATGCAGATCCAGAAACAACCCTTT
TGGCCTACCTGAGAAGAAAGTTGGGGCTGAGCGGGACCAAGCTGGGCTGTGGCGAAGGGG
GCTGCGGGGCTTGCACTGTGATGTTTTCCAAGTATGATCGTCTCCAGAACAAGATCGTCC
ACTTTTCTGCCAATGCCTGCCTGGCTCCCATCTGTTCCTTGCACCATGTTGCTGTGACGA

1. Flatten a fasta file

1.1 flatten fasta to 100 nucleotides each line

faformat -in sample_data.fa -out out.fa -num 100
>XM_005600090.2 100054601|LOC100054601|protein_coding
ACAGGACCCCTGCTTGCTGCTATCTGTGTGAGCCTCCCTCTGTGCTCTTTGATATAGTTCTTGGGGATGTGATTTGCTGTGTATGATTGCTTGCCTTTTC
CTTGATTATGTAAATCAGGGACTTTTGCCAGGAACATTCCATTCCCGGAAACTAAGCTGGTTCCCTCCCATCGGCTGACTTGGTTTCATTTTACTGAAGT
TTTTCAAGTGGCAGAGCAGTAATAACTGTCTGTGCCTCTTGGCAGTGTGATACCTGGAGTTCAGAACCCTAAACGGTGACAATGACAGCAGATGAGTTGG
TTTTCTTTGTCAATGGCAAAAAGGTGGTGGAGAAAAATGCAGATCCAGAAACAACCCTTTTGGCCTACCTGAGAAGAAAGTTGGGGCTGAGCGGGACCAA
GCTGGGCTGTGGCGAAGGGGGCTGCGGGGCTTGCACTGTGATGTTTTCCAAGTATGATCGTCTCCAGAACAAGATCGTCCACTTTTCTGCCAATGCCTGC
CTGGCTCCCATCTGTTCCTTGCACCATGTTGCTGTGACGACTGTGGAAGGAATAGGAAGCACCAAGACAAGGCTGCATCCTGTGCAGGAGAGAATTGCTA
AAAGCCACGGGTCCCAGTGTGGGTTCTGCACCCCCGGCATCGTCATGAGCATGTACACGCTGCTCCGGAACCAGCCCGAGCCCACCGTGGAGGAGATCGA
GGATGCCTTCCAAGGGAACTTGTGCCGCTGCACAGGCTACAGACCCATCCTCCAGGGCTTCCGGACCTTCGCCAGGGATGGTGGATGCTGTGGAGGAAAG

1.2 flatten sequence to single line

faformat -in sample_data.fa -out out.fa -all

2. sort sequence

2.1 sort by length

# increasingly
faformat -in sample_data.fa -out out.fa -sort len
# decreasingly
faformat -in sample_data.fa -out out.fa -sort len -reverse

2.1 sort by sequence name

# increasingly
faformat -in sample_data.fa -out out.fa -sort chr_id
# decreasingly
faformat -in sample_data.fa -out out.fa -sort chr_id -reverse

3. get/fetch sequences from fasta

3.1 get sequences from fasta by chr_id

# get XM_014734048.1,XM_005607463.2,XM_014741515.1 from fasta
faformat -in sample_data.fa -out out.fa -fetch "XM_014734048.1,XM_005607463.2,XM_014741515.1"

3.2 remove sequences from fasta by chr_id

# remove XM_014734048.1,XM_005607463.2,XM_014741515.1 from fasta
faformat -in sample_data.fa -out out.fa -remove "XM_014734048.1,XM_005607463.2,XM_014741515.1"

3.3 sample 10 sequence from fasta

faformat -in sample_data.fa -out out.fa -sample 10

3.4 get/remove sequences from fasta by chr_id regex expression

# get all sequence with sequence id: NM*
faformat -in sample_data.fa -out out.fa -fp_chrid "^NM"
# remove all sequence with sequence id: NM*
faformat -in sample_data.fa -out out.fa -rp_chrid "^NM"

3.4 get/remove sequences from fasta by sequence annotation regex expression

# get all lncRNA sequence
faformat -in sample_data.fa -out out.fa -fp_anno "lncRNA"
# remove all protein_coding sequence
faformat -in sample_data.fa -out out.fa -rp_anno "protein_coding"

4. show help information

[lee@lipan faformat 1.0.0]$ ./faformat -h
faformat - format fasta file(sort, format output, sub-sample)
=============================================================
USAGE:
        faformat -in input_fasta -out output_fasta [ -sort len|chr_id -reverse -num 60 -all -sto -append
                 -remove chr_id_1,chr_id_2... -sample number -fetch chr_id_1,chr_id_2... ]
HELP:
        [order]
        -sort: len|chr_id, output fasta sort by sequence length or chr_id(default: no sort)
        -reverse: sort all sequence in reverse order(default: no)

        [output format]
        -num: base number of each line(default: 60)
        -all: output all base in a line(default: no)
        -append: append the output to the file(default: no)
        -sto: output sto format(default: no)

        [sub-sample]
        -remove: remove some chr from input file(default: no remove)
        -sample: sample some chr from input file(default: no sample)
        -fetch: get some chr from input file(default: no fetch)

        [sub-sample pattern]
        -rp_chrid: <regex> remove some chr from input file whose chr_id match the regex 
        -rp_anno: <regex> get some chr from input file whose chr annotation match the regex
        -fp_chrid: <regex> remove some chr from input file whose chr_id match the regex
        -fp_anno: <regex> get some chr from input file whose chr annotation match the regex

HELP:
VERSION:
        1.101
LIB VERSION:
        Basic 1.0.0(2017-11-xx)
VERSION DATE:
        2017-11-25
COMPILE DATE:
        Jan 19 2018
AUTHOR:
        Li Pan