|
Fgenesh-2 |
Program for predicting multiple genes in genomic DNA sequences using HMM gene model and genomic sequences of two close organisms to increase reliability of true exon and gene identification
The program can be used if DNA sequences of homologous genomic regions of two similar organisms, such as Human and mouse, are available.
Ab initio gene prediction programs usually correctly predict significant fraction of exons in a gene, but they often assemble gene in incorrect way: combine several genes or split one gene into several, skip exons or include false exons. Using sequences of two organisms can significantly improve accuracy of exact gene finding, taking into account that Human genome draft sequence and Mouse genomic sequence provide a lot of homologous sequences.
Program shows predicted genes in both sequences as two sequential Fgenesh outputs.
G - predicted gene number, starting from start of sequence;
Str - DNA strand (+ for direct or - for complementary);
Feature - type of coding sequence: CDSf - First (Starting with Start codon), CDSi - internal (internal exon), CDSl - last coding segment, ending with stop codon);
TSS - Position of transcription start (TATA-box position and score);
Start and End - Position of the Feature;
Weight - Log likelihood*10 score for the feature ORF - start/end positions where the
first complete codon starts and the last codon ends.
Last three values: Length of exon, positions
in protein, percent of similarity with target protein
EXAMPLE of output for genes predicted in Human and Mouse genomic sequences:
Fgenesh-2 1.C Prediction of potential genes in 1st genomic DNA Time: Fri Nov 10 02:55:51 2000 Seq name: HSCKIIBE Length of sequence: 5917 GC content: 53 Zone: 3 Number of predicted genes 1 in +chain 1 in -chain 0 Number of predicted exons 6 in +chain 6 in -chain 0 Positions of predicted genes and exons: G Str Feature Start End Score ORF Len 1 + 1 CDSf 1634 - 1705 18.99 1634 - 1705 72 1 + 2 CDSi 2672 - 2774 38.26 2672 - 2773 102 1 + 3 CDSi 3344 - 3459 41.09 3346 - 3459 114 1 + 4 CDSi 3906 - 3981 25.73 3906 - 3980 75 1 + 5 CDSi 4128 - 4317 67.44 4130 - 4315 186 1 + 6 CDSl 4645 - 4735 29.35 4646 - 4735 90 1 + PolA 4855 0.92 Predicted protein(s): >Fgenesh-2 1 6 exon (s) 1634 - 4735 215 aa, chain + MSSSEEVSWISWFCGLRGNEFFCEVDEDYIQDKFNLTGLNEQVPHYRQALDMILDLEPDE ELEDNPNQSDLIEQAAEMLYGLIHARYILTNRGIAQMLEKYQQGDFGYCPRVYCENQPML PIGLSDIPGEAMVKLYCPKCMDVYTPKSSRHHHTDGAYFGTGFPHMLFMVHPEYRPKRPA NQFVPRLYGFKIHPMAYQLQLQAASNFKSPVKTIR Fgenesh-2 1.C Prediction of potential genes in 2nd genomic DNA Time: Fri Nov 10 02:55:51 2000 Seq name: MMGMCK2B Length of sequence: 7874 GC content: 51 Zone: 2 Number of predicted genes 1 in +chain 1 in -chain 0 Number of predicted exons 6 in +chain 6 in -chain 0 Positions of predicted genes and exons: G Str Feature Start End Score ORF Len 1 + 1 CDSf 2169 - 2240 38.64 2169 - 2240 72 1 + 2 CDSi 2829 - 2931 28.70 2829 - 2930 102 1 + 3 CDSi 4112 - 4227 36.45 4114 - 4227 114 1 + 4 CDSi 4615 - 4690 18.76 4615 - 4689 75 1 + 5 CDSi 4801 - 4990 56.00 4803 - 4988 186 1 + 6 CDSl 6262 - 6352 18.70 6263 - 6352 90 1 + PolA 6470 0.92 Predicted protein(s): >Fgenesh-2 1 6 exon (s) 2169 - 6352 215 aa, chain + MSSSEEVSWISWFCGLRGNEFFCEVDEDYIQDKFNLTGLNEQVPHYRQALDMILDLEPDE ELEDNPNQSDLIEQAAEMLYGLIHARYILTNRGIAQMLEKYQQGDFGYCPRVYCENQPML PIGLSDIPGEAMVKLYCPKCMDVYTPKSSRHHHTDGAYFGTGFPHMLFMVHPEYRPKRPA NQFVPRLYGFKIHPMAYQLQLQAASNFKSPVKTIR