EstMap

Program for mapping a whole set of mRNAs/ESTs to a chromosome sequence. For example, 11,000 sequences of full mRNAs from NCBI reference set were mapped to 52-MB unmasked Y chromosome fragment in about 18-25 min, depending on computer memory size. EstMap takes into account statistical features of splice sites for more accurate mapping.

EstMap is part of FGENESH++C genome annotation pipeline, where it maps RefSeq sequences to a query genome at very early stages of annotation.

Example of an output of the EstMap program:


L:4000001    Sequence chr7 [cut:73000000 77000000] vs C:\Documents and Settings\My Documents\MolQuestWorkSpace\example_data\EstMap\seq.fa
[DD] Sequence:       1(      1), S:       36.26, L:      457 AA628013   nq61d05.s1 NCI_CGAP_Co9 Homo sapiens cDNA clone IMAGE:1148361 3', mRN
Summ of block lengths: 457, Alignment bounds:
On first  sequence: start   2214596, end   2215412, length 817
On second sequence: start         1, end       457, length 457
Block of alignment: 4        
    1 E:  2214596    234 [ct CT] P:   2214596         1 L:     234, G:  99.57, W:   2305, S:26.2324
    2 E:  2214966     69 [AC CT] P:   2214966       235 L:      69, G: 100.00, W:    690, S:14.1834
    3 E:  2215144     65 [AC CT] P:   2215144       304 L:      65, G: 100.00, W:    650, S:13.7542
    4 E:  2215324     89 [AC aa] P:   2215324       369 L:      89, G:  97.75, W:    820, S:15.6754
        1 gagccaagattgtgc(..)acgctcaggccacct?[CTGGGCCTCTCTTTATTGAGGGCA
          ...............(..)...............  ||||||||||||||||||||||||
        1 ---------------(..)---------------  CTGGGCCTCTCTTTATTGAGGGCA

  2214620 CTGGGCCCAGGTCTTCCTTCAGGGCCCACAGCGCCCATAAAACCCAAGGGAGAATAGAAG
          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
       25 CTGGGCCCAGGTCTTCCTTCAGGGCCCACAGCGCCCATAAAACCCAAGGGAGAATAGAAG

  2214680 AGACCCCCTGATACACGCACACTCGAGGGGCGCCTCCCATCCCCTCCCACAACACACAGG
          ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
       85 AGACCCCCTGATACACGCACACTCGAGGGGCGCCTCCCATCCCCTCCCACAACACACAGG

  2214740 ACAGAAGCCCCTCTGGGCCGGCAGGGGAAGGCCCAGCCTCAATCCTTCTTGCTCCCGTGC
          |||||||||||||||||||||||0||||||||||||||||||||||||||||||||||||
      145 ACAGAAGCCCCTCTGGGCCGGCAAGGGAAGGCCCAGCCTCAATCCTTCTTGCTCCCGTGC

  2214800 CGCTGACTGTGAAACTTGTGGTGCACAACC]ctcagggtggtgaag(..)gggaccccgg
          |||||||||||||||||||||||||||||| ...............(..)..........
      205 CGCTGACTGTGAAACTTGTGGTGCACAACC ---------------(..)----------

  2214961 ctcac[CTGCCACTCCTTGCACTGAGGGTCCTGGGCCAGGTTGAACAACGTCAGCGCGTT
          ..... ||||||||||||||||||||||||||||||||||||||||||||||||||||||
      235 ----- CTGCCACTCCTTGCACTGAGGGTCCTGGGCCAGGTTGAACAACGTCAGCGCGTT

  2215020 AAAAAGCTGCCAGAA]ctaagcagggaggag(..)agaggcacgacttac[GTGTCCAAA
          ||||||||||||||| ...............(..)............... |||||||||
      289 AAAAAGCTGCCAGAA ---------------(..)--------------- GTGTCCAAA

  2215153 GAAAAGAAAAGGCAGCAGGAAGGTGAGGCCCCGCCACATCCAGGACTGGAAGCCCT]ctg
          |||||||||||||||||||||||||||||||||||||||||||||||||||||||| ...
      313 GAAAAGAAAAGGCAGCAGGAAGGTGAGGCCCCGCCACATCCAGGACTGGAAGCCCT ---

  2215212 cggggaggaagg(..)ccactcccgactcac[CCACAGTGAGGTCCATGGTGTGCCGCTC
          ............(..)............... ||||||||||||||||||||||||||||
      369 ------------(..)--------------- CCACAGTGAGGTCCATGGTGTGCCGCTC

  2215352 GCCCAGCGCCCGCAGGCGGTAGAGGCAGCCGCTCTGGTAGTAGTACTGGAGAAACTGCAC
          ||||||||||||||||0|0|||||||||||||||||||||||||||||||||||||||||
      397 GCCCAGCGCCCGCAGGGGATAGAGGCAGCCGCTCTGGTAGTAGTACTGGAGAAACTGCAC

  2215412 G]?aagcctgggccgggc(..)tacagcaaaactgga
          |  ...............(..)...............
      457 G  ---------------(..)---------------
	

Where:

1-st line is the header:


[DD] Sequence:       1(      1), S:       36.26, L:      457 AA628013   nq61d05.s1 NCI_CGAP_Co9 Homo sapiens cDNA clone IMAGE:1148361 3', mRNA sequence.
[DD] Target sequence in direct chain (D), query sequence in direct chain (D). Variants:
[DR] - target sequence in direct chain (D), query sequence in reverse chain (R).
[RD] - target sequence in reverse chain (R), query sequence in direct chain (D).
[RR] - target sequence in reverse chain (R), query sequence in reverse chain (R).
Sequence: 1( 1) Order number of sequence from a query set which is submitted to alignment. In brackets is an order number for alignment of this sequence (if it resulted in more than one alignment). Variants: 4(      5) - the fifth alignment of the fourth sequence from a set
S Score of this alignment.
L Length of this query sequence
AA628013 nq61d05.s1 NCI_CGAP_Co9 Homo sapiens cDNA clone IMAGE:1148361 3', mRNA sequence. Name of this query sequence

Additional information about alignment:


Summ of block lengths: 457, Alignment bounds:
On first  sequence: start   2214596, end   2215412, length 817
On second sequence: start         1, end       457, length 457
length The length covered by alignment, in target and query sequences appropriately.

List of alignment blocks:


Block of alignment: 4        
    1 E:  2214596    234 [ct CT] P:   2214596         1 L:     234, G:  99.57, W:   2305, S:26.2324
    2 E:  2214966     69 [AC CT] P:   2214966       235 L:      69, G: 100.00, W:    690, S:14.1834

Block of alignment: 4 - Number of blocks in this alignment.
Each line below defines an appropriate block. Detailed description of a line from this list is shown further:


   1 E:  2214596    234 [ct CT] P:   2214596         1 L:     234, G:  99.57, W:   2305, S:26.2324
	
1 Block number.
E: 2214596 234 [ct CT] Starting point and length of exon in the first sequence.
[ct CT] - edging nucleotides of exon.
Small letters - the edge is defined imprecisely. Capital letters - the edge is defined precisely.
P: 2214596 1 Positions of similarity block' start in target and query sequences appropriately.
L: 234 Length of this similarity block.
G: 99.57 Homology of this similarity block.
W: 2305 Weight of this similarity block (the arithmetic sum of symbols' similarity calculated from the given similarity matrix).
S:26.2324 Score of this similarity block.

Alignment:


        1 gagccaagattgtgc(..)acgctcaggccacct?[CTGGGCCTCTCTTTATTGAGGGCA
          ...............(..)...............  ||||||||||||||||||||||||
        1 ---------------(..)---------------  CTGGGCCTCTCTTTATTGAGGGCA


1 line - The target sequence itself. Capital letters correspond to blocks of similarity, lower case - not aligned regions. [] - edges of exon. ?[ - unsure edge of exon.
2 line - Separator line.
3 line - The query sequence itself. Capital letters correspond to blocks of similarity, lower case - not aligned regions.