|
TSSG |
Recognition of human PolII promoter region and start of transcription
TSSG is the most accurate mammalian promoter prediction program. The following table shows results of promoter search on genes with known mRNAs by different promoter finding programs, reproduced with changes from Liu and States (2002) Genome Research 12:462-469. It shows that TSSG has by far the fewest false positive predictions.
Program |
Set1 (133 promoters)
|
Set2 (120 promoters)
|
||
True predictions
|
False Predictions
|
True predictions
|
False Predictions
|
|
PROSCAN1.7 |
32 (24%)
|
18 (36%)
|
30 (25%)
|
22 (42%)
|
NNPP2.0 |
56 (42%)
|
41 (42%)
|
26 (22%)
|
50 (66%)
|
PromFD1.0 |
88 (66%)
|
43 (33%)
|
69 (58%)
|
57 (45%)
|
Promoter2.0 |
8 (6%)
|
100 (93%)
|
14 (12%)
|
92 (88%)
|
TSSG |
75 (56%)
|
10 (12%)
|
62 (52%)
|
18 (23%)
|
TSSW |
57 (43%)
|
29 (34%)
|
58 (48%)
|
20 (26%)
|
Algorithm predicts potential transcription start positions by linear discriminant function combining characteristics describing functional motifs and oligonucleotide composition of these sites. TSSG uses promoter.dat file with selected factor binding sites (TFD, Ghosh,1993) developed by Dan Prestridge to calculate the density of functional sites as in J.Mol.Biol.,1995,249,923-932.
For approximately 50-55% level of true promoter region recognition, TSSG program gives one false positive prediction for about 5000 bp. This accuracy is similar with the test sequences anlysis by Prestridge's method. We estimate an accuracy of finding TSS position on ten test genes where both our and Prestridge's algorithms found promoter region to be as follows (numbers show dictance between actual and predicted TSS):
Method/distance |
<5bp
|
5-50 bp
|
50-150 bp
|
Mean of observed distance
|
Prestridge's |
0
|
3
|
7
|
81.2 bp
|
TSSG |
7
|
3
|
0
|
7.3 bp
|
Another Softberry promoter recognition program TSSW is based on similar ideology, but uses data from older release of Biobase's Transfac® data base (E.Wingender, J.Biotech., 1994, 35, 273-280).
References:
1. Solovyev V.V., Salamov A.A. (1997)
The Gene-Finder computer tools for analysis of human and model organisms genome sequences.
In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology (eds.Rawling C.,Clark D.,
Altman R.,Hunter L.,Lengauer T.,Wodak S.), Halkidiki, Greece, AAAI Press,294-302.
2. Solovyev V.V. (2001)
Statistical approaches in Eukaryotic gene prediction.
In Handbook of Statistical genetics (eds. Balding D. et al.), John Wiley & Sons, Ltd., p. 83-127.
3. Solovyev VV, Shahmuradov IA. (2003)
PromH: Promoters identification using orthologous genomic sequences.
Nucleic Acids Res. 31(13):3540-3545.
First line - name of your sequence;
second and third lines - LDF threshold and the length of presented sequence
Fourth line - Number of predicted promoter regions
Next lines - positions of predicted sites, their 'weights' and TATA box position (if found)
Position shows the first nucleotide of the transcript (TSS position)
After that functional motifs are given for each predicted region;
(+) or (-) reflects the direct or complementary chain; S... means a particular motif identificator from the Ghosh data base.
HSCALCAC 7637 bp DNA PRI 14-MAR-1995 Length of sequence- 7637 Threshold for LDF- 4.00 1 promoter(s) were predicted Pos.: 1820 LDF- 16.65 TATA box predicted at 1804 Transcription factor binding sites: for promoter at position - 1820 1764 (-) S00098 AACCAAT 1608 (-) S01152 AAGTGA 1741 (+) S01153 AARKGA 1608 (-) S01153 AARKGA 1657 (+) S01090 AATGA 1617 (-) S01027 ACGCCC 1577 (+) S00534 ACGTCA 1580 (-) S00534 ACGTCA 1580 (-) S01257 ACGTCAT ..............................
Lower cased letters mean non-conserved nucleotides in the site consensus
The letters except (A,T,G,C) describe ambiguous sites in a given DNA sequence motif, where a single character may represent more than one nucleotide using Standard IUPAC Nucleotide code.
See TABLE at http://www.yeastract.com/help/help_searchbydnamotif.php#Ref1
IUPAC Code | Meaning | Origin of Description |
G | G | Guanine |
A | A | Adenine |
T | T | Thymine |
C | C | Cytosine |
R | G or A | puRine |
Y | T or C | pYrimidine |
M | A or C | aMino |
K | G or T | Ketone |
S | G or C | Strong interaction |
W | A or T | Weak interaction |
H | A or C or T | not-G, H follows G in the alphabet |
B | G or T or C | not-A, B follows A in the alphabet |
V | G or C or A | not-T (not-U), V follows U in the alphabet |
D | G or A or T | not-C, D follows C in the alphabet |
N | G or A or T or C | aNy |