|
ProtMap |
New Fast Tool for Aligning Proteins with Genome and Accurately Reconstructing Exon-intron Gene Structure
ProtMap program maps a set of protein sequences to a genomic sequence, producing gene structures and corresponding alignments of coding exons with the similar or identical protein queries. ProtMap uses a genomic sequence and a set of protein sequences as its input data, and reconstructs gene structure based on protein identity or homology, in contrast to a set of unordered alignment fragments generated by Blast. The program is very fast, and it produces gene structures similar to those of Genewise program, which is hundreds times slower (see Table 1 for speed comparison). Accuracy can be further significantly improved by use of Fgenesh+ on ProtMap output: see Table 2 fro accuracy comparison).
ProtMap is used as a part of Softberry automatic genome annotation pipeline, Fgenesh++C. We also use it for generating putative gene models for genefinding parameters training on new genomes, for which few or no known genes are available. ProtMap is also very useful for finding pseudogenes as corrupted gene structures that map to known protein sequences.
Figure 1. Example of mapping a protein sequence to human chromosome 19.
L:3000000 Sequence Chr19 [cut:1 3000000] [DD] Sequence: 1( 1), S: 105.56, L:1739 IPI:IPI00170643.1|SWISS-PROT:Q8TEK3-1 Tax_Id=9606 Splice isoform 2 of Q8TEK3 Summ of block lengths: 1284, Alignment bounds: On first sequence: start 2146727, end 2167197, length 20471 On second sequence: start 263, end 1682, length 1420 Blocks of alignment: 21 1 E: 2146727 70 [ca GT] P: 2146727 263 L: 23, G: 101.574 S:14.75 2 E: 2147573 107 [AG GT] P: 2147575 287 L: 35, G: 103.465, S:18.56 3 E: 2148934 42 [AG GT] P: 2148934 322 L: 14, G: 103.043, S:11.68 4 E: 2150399 111 [AG GT] P: 2150399 336 L: 37, G: 102.130, S:18.82 5 E: 2150620 235 [AG GT] P: 2150620 373 L: 78, G: 101.500, S:27.15 6 E: 2151098 114 [AG GT] P: 2151100 452 L: 37, G: 106.924, S:19.76 7 E: 2151750 92 [AG GT] P: 2151752 490 L: 30, G: 101.424, S:16.82 8 E: 2153538 102 [AG GT] P: 2153538 520 L: 34, G: 100.496, S:17.73 9 E: 2153848 138 [AG GT] P: 2153848 554 L: 46, G: 99.003, S:20.30 10 E: 2154470 126 [AG GT] P: 2154470 600 L: 42, G: 101.283, S:19.87 1 11 2146713 2146723 2146739 2146769 gatcacagaggctgg(..)agtgtctgtgtttca?[GGRIVSSKPFAPLNFRINSRNLSg ---------------(..)evdhqlkerfanmke GGRIVSSKPFAPLNFRINSRNLS- 248 248 249 259 267 277 2146797 2146806 2147558 2147568 2147581 2147611 ]gtaagaaactctcat(..)ctgtggctcctgcag[acIGTIMRVVELSPLKGSVSWTGK ---------------(..)--------------- -dIGTIMRVVELSPLKGSVSWTGK 286 286 286 286 289 299 2147641 2147671 2147686 2148919 2148926 2148937 PVSYYLHTIDRTI]gtgagtatctcgctg(..)ctttcttctttttag[LENYFSSLKNP PVSYYLHTIDRTI ---------------(..)--------------- LENYFSSLKNP 309 319 322 322 322 323 2148967 2148982 2150384 2150391 2150402 2150432 KLR]gtaagtttgtgtgtt(..)ctgctctccttccag[EEQEAARRRQQRESKSNAATP KLR ---------------(..)--------------- EEQEAARRRQQRESKSNAATP 333 336 336 336 337 347 2150462 2150492 2150513 2150523 2150609 2150619 TKGPEGKVAGPADAPM]gtaaggccccagcct(..)ccttgtgtcctccag[DSGAEEEK TKGPEGKVAGPADAPM ---------------(..)--------------- DSGAEEEK 357 367 373 373 373 373
Table 1. Speed of processing sequences by Prot_Map, Fgenesh+ and GeneWise.
Fgenesh+ | Prot_map | GeneWise | |
88 sequences of genes < 20 kb | ~1 min | ~1 min | ~90 min |
8 sequences of genes > 400000 kb | ~1 min | ~1 min | ~1200 min |
Table 2. Comparison of accuracy of gene identification programs: ab initio Fgenesh and prediction with protein support: Fgenesh+ , GeneWise and Prot_Map on a set of human genes using mouse or drosophila homologous proteins. Sn ex, Sensitivity on exon level (exact exon predictions); Sno ex, sensitivity with exon overlap; Sp ex, specificity, exon level; Sn nuc, seisitivity, nucleotides; Sp nuc, specificity, nucleotides; CC, correlation coefficient; %CG, percent of genes predicted completely correctly (no missing and no extra exons, and all exon boundaries are predicted exactly correctly).
Mouse homologs: 60% < similarity level < 80% - 1425 sequences
Sn ex | Sno ex | Sp ex | Sn nuc | Sp nuc | CC | %CG | |
Fgenesh | 83.4 | 90.9 | 86.8 | 93.2 | 94.9 | 0.937 | 30 |
Genewise | 88.1 | 96.5 | 90.5 | 97.8 | 99.2 | 0.984 | 43 |
Fgenesh+ | 93.9 | 97.9 | 94.9 | 98.4 | 99.3 | 0.988 | 65 |
Prot_map | 87.0 | 96.5 | 86.6 | 97.0 | 98.5 | 0.976 | 40 |
Drosophila homologs: similarity level > 80% - 66 sequences.
Sn ex | Sno ex | Sp ex | Sn nuc | Sp nuc | CC | CG% | |
Fgenesh | 90.5 | 93.8 | 95.1 | 97.9 | 96.9 | 0.950 | 55 |
Genewise | 79.3 | 83.9 | 86.8 | 97.3 | 99.5 | 0.985 | 23 |
Fgenesh+ | 95.1 | 97.8 | 97.0 | 98.9 | 99.5 | 0.9914 | 70 |
Prot_map | 86.4 | 95.3 | 88.1 | 97.6 | 99.0 | 0.982 | 41 |