|
FgenesB |
Bacterial Operon and Gene Prediction.
FgenesB - Suite of Bacterial Operon and Gene Finding Programs
FgenesB is the most accurate ab initio prokaryotic gene prediction engine (see Table 1 at the bottom for its comparison with two other popular gene prediction programs). FgenesB gene prediction algorithm is based on Markov chain models of coding regions and translation and termination sites. The program uses genome-specific parameters learned by FGENESB-train script, which requires only DNA sequence from genome of interest as an input. (If you need parameters for your new bacteria, please contact Softberry.) FgenesB also includes simplified prediction of operons based only on distances between predicted genes.
For community sequence annotation, ABsplit (www.softberry.com/berry.phtml?topic=absplit&group=programs&subgroup=gfindb) program can be used that separates archaebacterial and eubacterial sequences.
FgenesB was used in first ever published bacterial community annotation project: see Tyson et al., (2004) Nature 428(6978), 37-43.
Example of FgenesB output:
1 1 Op 1 21/0.000 + CDS 407 - 1747 1311 2 1 Op 2 3/0.019 + CDS 1926 - 3065 1237 3 2 Op 1 4/0.002 + CDS 3193 - 3405 278 4 2 Op 2 4/0.002 + CDS 3418 - 4545 899 5 2 Op 3 16/0.000 + CDS 4578 - 6506 2148 6 2 Op 4 . + CDS 6595 - 9066 2957 7 3 Op 1 . - CDS 14175 - 14363 158 8 3 Op 2 . - CDS 14353 - 15249 351 9 3 Op 3 . - CDS 15170 - 15352 99
Table 1. Accuracy of prediction estimated on B.subtilis sequence: Frequency of genes starting from start codon other than first - 19.1% Borodovsky et al. (see GeneMark WEB pages (opal.biology.gatech.edu/GeneMark/genemarks.cgi)) has calculated accuracy for all genes, and has constructed three sets of difficult short genes (L ? 300bp) that have protein similarity support. There genes were used to demonstrate that short genes also can be predicted reasonably well. First set (51set) has 51 genes with at least 10 strong similarities to known proteins. Then, 72set has 72 genes with at least two strong similarities, and 123set has 123 genes with at least one protein homolog.
Here are the prediction results on these three sets for GeneMarkS and Glimmer (calculated in Nucleic Acids Research, 2001, Vol. 29, No. 12, 2607-2618.) and FgenesB (calculated by Softberry, three iterations of FgenesB-train script):
Sn (exact Sn (exact+overlapping predictions) predictions) 123set: Glimmer 57.0% 91.1 GeneMarkS 82.9 91.9 FgenesB 89.3 98.4 72set: Glimmer 57.0% 91.7 GeneMarkS 88.9 94.4 FgenesB 91.5 98.6 51set: Glimmer 51.0% 88.2 GeneMarkS 90.2 94.1 FgenesB 92.0 98.0 All genes of B.subtilis genome(GenBabk annotation): Glimmer 62.4% 98.1 GeneMarkS 83.2 96.7 FgenesB 83.8 98.7
Please note that many genes in GenBank were annotated using GeneMark program, which should result in overestimation of its accuracy.