|
Fgenesh+ |
Program for predicting multiple genes in genomic DNA sequences using HMM gene model plus homology with known protein
Fgenesh+ was developed to analyse sequences from human, drosophila, nematode and plant, as well related organisms. The program can be used if you know protein sequence similar to protein which is predicted for a gene in your sequence. First, run any ab initio gene finding program such as Fgenes or Fgenesh. Then, run BLASTP DB search with each predicted exon. Any true predicted exon can provide you with known similar proteins, if such proteins exist in the DB. Take sequence of homologous protein and run Fgenesh+. The accuracy of gene prediction can be up to 100% depending of how similar the predicted and DB protein are.
Softberry significantly improved its gene prediction with protein support programs. New Prot_map program can be used to generate a set of gene in new organism and use them to learn parameters for gene prediction programs fgenesh and Fgenesh+. It is very useful to find pseudogenes by selection corrupted genes generated by mapping known proteins.
Fgenesh+ | Prot_map | GeneWise | |
88 sequences of genes < 20 kb | ~1 min | ~1 min | ~90 min |
8 sequences of genes > 400000 kb | ~1 min | ~1 min | ~1200 min |
Prot_map mapping of Human protein set of 55946 proteins on chromosome 19 (~59 MB) takes just 90 min (best hit for each protein) and 148 min (all significant hits for each protein)
Comparison of accuracy of gene prediction by ab initio Fgenesh and prediction with protein support by Fgenesh+ or GenWise and Prot_map - mapping protein to human DNA is done on large set of human genes with using mouse or drosophila homologous proteins. We can see that Fgenesh+ shows the best performance with mouse proteins. With Drosophila proteins ab initio prediction Fgenesh works better than GeneWise for all ranges of similarity and Fgenesh+ is the best predictor if similarity is higher 60%.
Gene prediction with mouse protein support:
Similarity level > 90% - 921 sequences
Sn ex | Sno ex | Sp ex | Sn nuc | Sp nuc | CC | %CG | |
Fgenesh | 86.2 | 91.7 | 88.6 | 93.9 | 93.4 | 0.9334 | 34 |
Genwise | 93.9 | 97.6 | 95.9 | 99.0 | 99.6 | 0.9926 | 66 |
Fgenesh+ | 97.3 | 98.9 | 98.0 | 99.1 | 99.6 | 0.9936 | 81 |
Prot_map | 95.9 | 98.3 | 96.9 | 99.1 | 99.5 | 0.9924 | 73 |
Gene prediction with Drosophila proteins with similarity ranging from 22% to 98% and coverage in both proteins > 75%:
Similarity level > 80% - 66 sequences.
Sn ex | Sno ex | Sp ex | Sn nuc | Sp nuc | CC | %CG | |
Fgenesh | 90.5 | 93.8 | 95.1 | 97.9 | 96.9 | 0.950 | 55 |
Genwise | 79.3 | 83.9 | 86.8 | 97.3 | 99.5 | 0.985 | 23 |
Fgenesh+ | 95.1 | 97.8 | 97.0 | 98.9 | 99.5 | 0.9914 | 70 |
Prot_map | 86.4 | 95.3 | 88.1 | 97.6 | 99.0 | 0.982 | 41 |
Ab initio gene prediction programs usually correctly predict significant fraction of exons in a gene, but they often assemble gene in incorrect way: combine several genes or split one gene into several, skip exons or include false exons. Using similarity information provided by one or several true predicted exons can significantly improve accuracy of gene finding.
You should provide similarity value known from the Blast or Prot_map search - it affects prediction. The programs uses similarity to estimate how similar the predicted gene product can be from its homolog.
Fgenesh+ output:
G - predicted gene number, starting from start of sequence; Str - DNA strand (+ for direct or - for complementary);
Feature - type of coding sequence: CDSf - First (Starting with Start codon), CDSi - internal (internal exon), CDSl - last coding segment, ending with stop codon);
TSS - Position of transcription start (TATA-box position and score);
Start and End - Position of the Feature;
Weight - Log likelihood*10 score for the feature ORF - start/end positions where the first complete codon starts and
the last codon ends.
Last three values: Length of exon, positions in protein, percent of similarity with target protein.
FGENESH+ 2.5 Prediction of potential genes in Homo_sapiens genomic DNA Time : Sun Jan 28 22:28:20 2007 Seq name: >Adh_and_cact.1 (2919020 bases) 848501 853000 Length of sequence: 4500 Homology: gi|2313041|gnl|PID|d1022564 (D84316) rab14 [Drosophila melanogaster] Length of homolog: 215 Number of predicted genes 1 in +chain 1 in -chain 0 Number of predicted exons 4 in +chain 4 in -chain 0 Positions of predicted genes and exons: Variant 1 from 1, Score:1130.648633 G Str Feature Start End Score ORF Len 1 + TSS 1459 -9.69 1 + 1 CDSf 2585 - 2690 190.55 2585 - 2689 105 1 35 100 1 + 2 CDSi 2756 - 2936 334.25 2758 - 2934 177 37 95 100 1 + 3 CDSi 2991 - 3173 315.47 2992 - 3171 180 97 156 100 1 + 4 CDSl 3242 - 3419 302.12 3243 - 3419 177 158 214 100 1 + PolA 3968 1.13 Predicted protein(s): >FGENESH: 1 4 exon (s) 2585 - 3419 215 aa, chain + MTAAPYNYNYIFKYIIIGDMGVGKSCLLHQFTEKKFMANCPHTIGVEFGTRIIEVDDKKI KLQIWDTAGQERFRAVTRSYYRGAAGALMVYDITRRSTYNHLSSWLTDTRNLTNPSTVIF LIGNKSDLESTREVTYEEAKEFADENGLMFLEASAMTGQNVEEAFLETARKIYQNIQEGR LDLNASESGVQHRPSQPSRTSLSSEATGAKDQCSC