TSSG

Recognition of human PolII promoter region and start of transcription

TSSG is the most accurate mammalian promoter prediction program. The following table shows results of promoter search on genes with known mRNAs by different promoter finding programs, reproduced with changes from Liu and States (2002) Genome Research 12:462-469. It shows that TSSG has by far the fewest false positive predictions.

Program
Set1 (133 promoters)
Set2 (120 promoters)
 
True predictions
False Predictions
True predictions
False Predictions
PROSCAN1.7
32 (24%)
18 (36%)
30 (25%)
22 (42%)
NNPP2.0
56 (42%)
41 (42%)
26 (22%)
50 (66%)
PromFD1.0
88 (66%)
43 (33%)
69 (58%)
57 (45%)
Promoter2.0
8 (6%)
100 (93%)
14 (12%)
92 (88%)
TSSG
75 (56%)
10 (12%)
62 (52%)
18 (23%)
TSSW
57 (43%)
29 (34%)
58 (48%)
20 (26%)

Method description:

Algorithm predicts potential transcription start positions by linear discriminant function combining characteristics describing functional motifs and oligonucleotide composition of these sites. TSSG uses promoter.dat file with selected factor binding sites (TFD, Ghosh,1993) developed by Dan Prestridge to calculate the density of functional sites as in J.Mol.Biol.,1995,249,923-932.

For approximately 50-55% level of true promoter region recognition, TSSG program gives one false positive prediction for about 5000 bp. This accuracy is similar with the test sequences anlysis by Prestridge's method. We estimate an accuracy of finding TSS position on ten test genes where both our and Prestridge's algorithms found promoter region to be as follows (numbers show dictance between actual and predicted TSS):

Method/distance
<5bp
5-50 bp
50-150 bp
Mean of observed distance
Prestridge's
0
3
7
81.2 bp
TSSG
7
3
0
7.3 bp

Another Softberry promoter recognition program TSSW is based on similar ideology, but uses data from older release of Biobase's Transfac® data base (E.Wingender, J.Biotech., 1994, 35, 273-280).

References:
1. Solovyev V.V., Salamov A.A. (1997)
The Gene-Finder computer tools for analysis of human and model organisms genome sequences.
In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology (eds.Rawling C.,Clark D., Altman R.,Hunter L.,Lengauer T.,Wodak S.), Halkidiki, Greece, AAAI Press,294-302.

2. Solovyev V.V. (2001)
Statistical approaches in Eukaryotic gene prediction.
In Handbook of Statistical genetics (eds. Balding D. et al.), John Wiley & Sons, Ltd., p. 83-127.

3. Solovyev VV, Shahmuradov IA. (2003)
PromH: Promoters identification using orthologous genomic sequences.
Nucleic Acids Res. 31(13):3540-3545.

TSSG output:

First line - name of your sequence;
second and third lines - LDF threshold and the length of presented sequence
Fourth line - Number of predicted promoter regions
Next lines - positions of predicted sites, their 'weights' and TATA box position (if found)
Position shows the first nucleotide of the transcript (TSS position)
After that functional motifs are given for each predicted region; (+) or (-) reflects the direct or complementary chain; S... means a particular motif identificator from the Ghosh data base.

For example:


 HSCALCAC     7637 bp    DNA             PRI       14-MAR-1995
 Length of sequence-      7637
 Threshold for LDF-  4.00
     1 promoter(s)  were predicted
 Pos.:   1820 LDF- 16.65 TATA box predicted at   1804
 Transcription factor binding sites:
for promoter at position -    1820
  1764 (-) S00098       AACCAAT
  1608 (-) S01152       AAGTGA
  1741 (+) S01153       AARKGA
  1608 (-) S01153       AARKGA
  1657 (+) S01090       AATGA
  1617 (-) S01027       ACGCCC
  1577 (+) S00534       ACGTCA
  1580 (-) S00534       ACGTCA
  1580 (-) S01257       ACGTCAT
..............................

Lower cased letters mean non-conserved nucleotides in the site consensus

The letters except (A,T,G,C) describe ambiguous sites in a given DNA sequence motif, where a single character may represent more than one nucleotide using Standard IUPAC Nucleotide code.

See TABLE at http://www.yeastract.com/help/help_searchbydnamotif.php#Ref1

IUPAC Code Meaning Origin of Description
G G Guanine
A A Adenine
T T Thymine
C C Cytosine
R G or A puRine
Y T or C pYrimidine
M A or C aMino
K G or T Ketone
S G or C Strong interaction
W A or T Weak interaction
H A or C or T not-G, H follows G in the alphabet
B G or T or C not-A, B follows A in the alphabet
V G or C or A not-T (not-U), V follows U in the alphabet
D G or A or T not-C, D follows C in the alphabet
N G or A or T or C aNy