faformat is a tool to format fasta file:
An example fasta file: download (3000 sequences)
>XM_005600090.2 100054601|LOC100054601|protein_coding
ACAGGACCCCTGCTTGCTGCTATCTGTGTGAGCCTCCCTCTGTGCTCTTTGATATAGTTC
TTGGGGATGTGATTTGCTGTGTATGATTGCTTGCCTTTTCCTTGATTATGTAAATCAGGG
ACTTTTGCCAGGAACATTCCATTCCCGGAAACTAAGCTGGTTCCCTCCCATCGGCTGACT
TGGTTTCATTTTACTGAAGTTTTTCAAGTGGCAGAGCAGTAATAACTGTCTGTGCCTCTT
GGCAGTGTGATACCTGGAGTTCAGAACCCTAAACGGTGACAATGACAGCAGATGAGTTGG
TTTTCTTTGTCAATGGCAAAAAGGTGGTGGAGAAAAATGCAGATCCAGAAACAACCCTTT
TGGCCTACCTGAGAAGAAAGTTGGGGCTGAGCGGGACCAAGCTGGGCTGTGGCGAAGGGG
GCTGCGGGGCTTGCACTGTGATGTTTTCCAAGTATGATCGTCTCCAGAACAAGATCGTCC
ACTTTTCTGCCAATGCCTGCCTGGCTCCCATCTGTTCCTTGCACCATGTTGCTGTGACGA
faformat -in sample_data.fa -out out.fa -num 100
>XM_005600090.2 100054601|LOC100054601|protein_coding
ACAGGACCCCTGCTTGCTGCTATCTGTGTGAGCCTCCCTCTGTGCTCTTTGATATAGTTCTTGGGGATGTGATTTGCTGTGTATGATTGCTTGCCTTTTC
CTTGATTATGTAAATCAGGGACTTTTGCCAGGAACATTCCATTCCCGGAAACTAAGCTGGTTCCCTCCCATCGGCTGACTTGGTTTCATTTTACTGAAGT
TTTTCAAGTGGCAGAGCAGTAATAACTGTCTGTGCCTCTTGGCAGTGTGATACCTGGAGTTCAGAACCCTAAACGGTGACAATGACAGCAGATGAGTTGG
TTTTCTTTGTCAATGGCAAAAAGGTGGTGGAGAAAAATGCAGATCCAGAAACAACCCTTTTGGCCTACCTGAGAAGAAAGTTGGGGCTGAGCGGGACCAA
GCTGGGCTGTGGCGAAGGGGGCTGCGGGGCTTGCACTGTGATGTTTTCCAAGTATGATCGTCTCCAGAACAAGATCGTCCACTTTTCTGCCAATGCCTGC
CTGGCTCCCATCTGTTCCTTGCACCATGTTGCTGTGACGACTGTGGAAGGAATAGGAAGCACCAAGACAAGGCTGCATCCTGTGCAGGAGAGAATTGCTA
AAAGCCACGGGTCCCAGTGTGGGTTCTGCACCCCCGGCATCGTCATGAGCATGTACACGCTGCTCCGGAACCAGCCCGAGCCCACCGTGGAGGAGATCGA
GGATGCCTTCCAAGGGAACTTGTGCCGCTGCACAGGCTACAGACCCATCCTCCAGGGCTTCCGGACCTTCGCCAGGGATGGTGGATGCTGTGGAGGAAAG
faformat -in sample_data.fa -out out.fa -all
# increasingly
faformat -in sample_data.fa -out out.fa -sort len
# decreasingly
faformat -in sample_data.fa -out out.fa -sort len -reverse
# increasingly
faformat -in sample_data.fa -out out.fa -sort chr_id
# decreasingly
faformat -in sample_data.fa -out out.fa -sort chr_id -reverse
# get XM_014734048.1,XM_005607463.2,XM_014741515.1 from fasta
faformat -in sample_data.fa -out out.fa -fetch "XM_014734048.1,XM_005607463.2,XM_014741515.1"
# remove XM_014734048.1,XM_005607463.2,XM_014741515.1 from fasta
faformat -in sample_data.fa -out out.fa -remove "XM_014734048.1,XM_005607463.2,XM_014741515.1"
faformat -in sample_data.fa -out out.fa -sample 10
# get all sequence with sequence id: NM*
faformat -in sample_data.fa -out out.fa -fp_chrid "^NM"
# remove all sequence with sequence id: NM*
faformat -in sample_data.fa -out out.fa -rp_chrid "^NM"
# get all lncRNA sequence
faformat -in sample_data.fa -out out.fa -fp_anno "lncRNA"
# remove all protein_coding sequence
faformat -in sample_data.fa -out out.fa -rp_anno "protein_coding"
[lee@lipan faformat 1.0.0]$ ./faformat -h
faformat - format fasta file(sort, format output, sub-sample)
=============================================================
USAGE:
faformat -in input_fasta -out output_fasta [ -sort len|chr_id -reverse -num 60 -all -sto -append
-remove chr_id_1,chr_id_2... -sample number -fetch chr_id_1,chr_id_2... ]
HELP:
[order]
-sort: len|chr_id, output fasta sort by sequence length or chr_id(default: no sort)
-reverse: sort all sequence in reverse order(default: no)
[output format]
-num: base number of each line(default: 60)
-all: output all base in a line(default: no)
-append: append the output to the file(default: no)
-sto: output sto format(default: no)
[sub-sample]
-remove: remove some chr from input file(default: no remove)
-sample: sample some chr from input file(default: no sample)
-fetch: get some chr from input file(default: no fetch)
[sub-sample pattern]
-rp_chrid: <regex> remove some chr from input file whose chr_id match the regex
-rp_anno: <regex> get some chr from input file whose chr annotation match the regex
-fp_chrid: <regex> remove some chr from input file whose chr_id match the regex
-fp_anno: <regex> get some chr from input file whose chr annotation match the regex
HELP:
VERSION:
1.101
LIB VERSION:
Basic 1.0.0(2017-11-xx)
VERSION DATE:
2017-11-25
COMPILE DATE:
Jan 19 2018
AUTHOR:
Li Pan