Fgenesh-2

Program for predicting multiple genes in genomic DNA sequences using HMM gene model and genomic sequences of two close organisms to increase reliability of true exon and gene identification

The program can be used if DNA sequences of homologous genomic regions of two similar organisms, such as Human and mouse, are available.

Ab initio gene prediction programs usually correctly predict significant fraction of exons in a gene, but they often assemble gene in incorrect way: combine several genes or split one gene into several, skip exons or include false exons. Using sequences of two organisms can significantly improve accuracy of exact gene finding, taking into account that Human genome draft sequence and Mouse genomic sequence provide a lot of homologous sequences.

Program shows predicted genes in both sequences as two sequential Fgenesh outputs.
G - predicted gene number, starting from start of sequence;
Str - DNA strand (+ for direct or - for complementary);
Feature - type of coding sequence: CDSf - First (Starting with Start codon), CDSi - internal (internal exon), CDSl - last coding segment, ending with stop codon);
TSS - Position of transcription start (TATA-box position and score);
Start and End - Position of the Feature;
Weight - Log likelihood*10 score for the feature ORF - start/end positions where the first complete codon starts and the last codon ends.
Last three values: Length of exon, positions in protein, percent of similarity with target protein

EXAMPLE of output for genes predicted in Human and Mouse genomic sequences:


 Fgenesh-2 1.C Prediction of potential genes in 1st genomic DNA
 Time:   Fri Nov 10 02:55:51 2000
 Seq name: HSCKIIBE
 Length of sequence:  5917  GC content: 53 Zone: 3
 Number of predicted genes 1 in +chain 1 in -chain 0
 Number of predicted exons 6 in +chain 6 in -chain 0
 Positions of predicted genes and exons:
  G Str Feature    Start     End   Score        ORF           Len

  1 +   1 CDSf    1634 -    1705     18.99    1634 -    1705     72
  1 +   2 CDSi    2672 -    2774     38.26    2672 -    2773    102
  1 +   3 CDSi    3344 -    3459     41.09    3346 -    3459    114
  1 +   4 CDSi    3906 -    3981     25.73    3906 -    3980     75
  1 +   5 CDSi    4128 -    4317     67.44    4130 -    4315    186
  1 +   6 CDSl    4645 -    4735     29.35    4646 -    4735     90
  1 +     PolA    4855                0.92

Predicted protein(s):
>Fgenesh-2   1   6 exon (s)   1634  -   4735    215 aa, chain +
MSSSEEVSWISWFCGLRGNEFFCEVDEDYIQDKFNLTGLNEQVPHYRQALDMILDLEPDE
ELEDNPNQSDLIEQAAEMLYGLIHARYILTNRGIAQMLEKYQQGDFGYCPRVYCENQPML
PIGLSDIPGEAMVKLYCPKCMDVYTPKSSRHHHTDGAYFGTGFPHMLFMVHPEYRPKRPA
NQFVPRLYGFKIHPMAYQLQLQAASNFKSPVKTIR
 Fgenesh-2 1.C Prediction of potential genes in 2nd genomic DNA
 Time:   Fri Nov 10 02:55:51 2000
 Seq name: MMGMCK2B
 Length of sequence:  7874  GC content: 51 Zone: 2
 Number of predicted genes 1 in +chain 1 in -chain 0
 Number of predicted exons 6 in +chain 6 in -chain 0
 Positions of predicted genes and exons:
  G Str Feature    Start     End   Score        ORF           Len

  1 +   1 CDSf    2169 -    2240     38.64    2169 -    2240     72
  1 +   2 CDSi    2829 -    2931     28.70    2829 -    2930    102
  1 +   3 CDSi    4112 -    4227     36.45    4114 -    4227    114
  1 +   4 CDSi    4615 -    4690     18.76    4615 -    4689     75
  1 +   5 CDSi    4801 -    4990     56.00    4803 -    4988    186
  1 +   6 CDSl    6262 -    6352     18.70    6263 -    6352     90
  1 +     PolA    6470                0.92

Predicted protein(s):
>Fgenesh-2   1   6 exon (s)   2169  -   6352    215 aa, chain +
MSSSEEVSWISWFCGLRGNEFFCEVDEDYIQDKFNLTGLNEQVPHYRQALDMILDLEPDE
ELEDNPNQSDLIEQAAEMLYGLIHARYILTNRGIAQMLEKYQQGDFGYCPRVYCENQPML
PIGLSDIPGEAMVKLYCPKCMDVYTPKSSRHHHTDGAYFGTGFPHMLFMVHPEYRPKRPA
NQFVPRLYGFKIHPMAYQLQLQAASNFKSPVKTIR