Program for predicting multiple genes in genomic DNA sequences using HMM gene model plus homology with known protein

Fgenesh+ was developed to analyse sequences from human, drosophila, nematode and plant, as well related organisms. The program can be used if you know protein sequence similar to protein which is predicted for a gene in your sequence. First, run any ab initio gene finding program such as Fgenes or Fgenesh. Then, run BLASTP DB search with each predicted exon. Any true predicted exon can provide you with known similar proteins, if such proteins exist in the DB. Take sequence of homologous protein and run Fgenesh+. The accuracy of gene prediction can be up to 100% depending of how similar the predicted and DB protein are.

Softberry significantly improved its gene prediction with protein support programs. New Prot_map program can be used to generate a set of gene in new organism and use them to learn parameters for gene prediction programs fgenesh and Fgenesh+. It is very useful to find pseudogenes by selection corrupted genes generated by mapping known proteins.

Speed of processing sequences

  Fgenesh+ Prot_map GeneWise
88 sequences of genes < 20 kb ~1 min ~1 min ~90 min
8 sequences of genes > 400000 kb ~1 min ~1 min ~1200 min

Prot_map mapping of Human protein set of 55946 proteins on chromosome 19 (~59 MB) takes just 90 min (best hit for each protein) and 148 min (all significant hits for each protein)

Accuracy comparison

Comparison of accuracy of gene prediction by ab initio Fgenesh and prediction with protein support by Fgenesh+ or GenWise and Prot_map - mapping protein to human DNA is done on large set of human genes with using mouse or drosophila homologous proteins. We can see that Fgenesh+ shows the best performance with mouse proteins. With Drosophila proteins ab initio prediction Fgenesh works better than GeneWise for all ranges of similarity and Fgenesh+ is the best predictor if similarity is higher 60%.

Gene prediction with mouse protein support:

Similarity level > 90% - 921 sequences

  Sn ex Sno ex Sp ex Sn nuc Sp nuc CC %CG
Fgenesh 86.2 91.7 88.6 93.9 93.4 0.9334 34
Genwise 93.9 97.6 95.9 99.0 99.6 0.9926 66
Fgenesh+ 97.3 98.9 98.0 99.1 99.6 0.9936 81
Prot_map 95.9 98.3 96.9 99.1 99.5 0.9924 73

Gene prediction with Drosophila proteins with similarity ranging from 22% to 98% and coverage in both proteins > 75%:

Similarity level > 80% - 66 sequences.

  Sn ex Sno ex Sp ex Sn nuc Sp nuc CC %CG
Fgenesh 90.5 93.8 95.1 97.9 96.9 0.950 55
Genwise 79.3 83.9 86.8 97.3 99.5 0.985 23
Fgenesh+ 95.1 97.8 97.0 98.9 99.5 0.9914 70
Prot_map 86.4 95.3 88.1 97.6 99.0 0.982 41

Ab initio gene prediction programs usually correctly predict significant fraction of exons in a gene, but they often assemble gene in incorrect way: combine several genes or split one gene into several, skip exons or include false exons. Using similarity information provided by one or several true predicted exons can significantly improve accuracy of gene finding.

You should provide similarity value known from the Blast or Prot_map search - it affects prediction. The programs uses similarity to estimate how similar the predicted gene product can be from its homolog.

Fgenesh+ output:
G - predicted gene number, starting from start of sequence; Str - DNA strand (+ for direct or - for complementary);
Feature - type of coding sequence: CDSf - First (Starting with Start codon), CDSi - internal (internal exon), CDSl - last coding segment, ending with stop codon);
TSS - Position of transcription start (TATA-box position and score);
Start and End - Position of the Feature;
Weight - Log likelihood*10 score for the feature ORF - start/end positions where the first complete codon starts and the last codon ends.
Last three values: Length of exon, positions in protein, percent of similarity with target protein.


  FGENESH+ 2.5 Prediction of potential genes in Homo_sapiens genomic DNA
 Time    :   Sun Jan 28 22:28:20 2007
 Seq name: >Adh_and_cact.1 (2919020 bases) 848501 853000 
 Length of sequence: 4500 
 Homology: gi|2313041|gnl|PID|d1022564 (D84316) rab14 [Drosophila melanogaster] 
 Length of homolog: 215 
 Number of predicted genes 1 in +chain 1 in -chain 0
 Number of predicted exons 4 in +chain 4 in -chain 0
 Positions of predicted genes and exons: Variant   1 from   1, Score:1130.648633 
   G Str   Feature   Start        End    Score           ORF           Len

   1 +      TSS       1459               -9.69
   1 +    1 CDSf      2585 -      2690  190.55      2585 -      2689    105     1     35  100
   1 +    2 CDSi      2756 -      2936  334.25      2758 -      2934    177    37     95  100
   1 +    3 CDSi      2991 -      3173  315.47      2992 -      3171    180    97    156  100
   1 +    4 CDSl      3242 -      3419  302.12      3243 -      3419    177   158    214  100
   1 +      PolA      3968                1.13

Predicted protein(s):
>FGENESH:   1   4 exon (s)   2585  -   3419   215 aa, chain +
MTAAPYNYNYIFKYIIIGDMGVGKSCLLHQFTEKKFMANCPHTIGVEFGTRIIEVDDKKI
KLQIWDTAGQERFRAVTRSYYRGAAGALMVYDITRRSTYNHLSSWLTDTRNLTNPSTVIF
LIGNKSDLESTREVTYEEAKEFADENGLMFLEASAMTGQNVEEAFLETARKIYQNIQEGR
LDLNASESGVQHRPSQPSRTSLSSEATGAKDQCSC