|
TSSP |
Recognition of human Pol II promoter region and start of transcription
Algorithm predicts potential transcription start positions by linear discriminant function combining characteristics describing functional motifs and oligonucleotide composition of these sites. TSSP uses file with selected factor binding sites from RegSite DB (Plants) developed by Softberry Inc.
References:
1. Solovyev V.V., Salamov A.A. (1997)
The Gene-Finder computer tools for analysis of human and model organisms genome sequences.
In Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology (eds.Rawling C.,Clark D.,
Altman R.,Hunter L.,Lengauer T.,Wodak S.), Halkidiki, Greece, AAAI Press,294-302.
2. Solovyev V.V. (2001)
Statistical approaches in Eukaryotic gene prediction.
In Handbook of Statistical genetics (eds. Balding D. et al.), John Wiley & Sons, Ltd., p. 83-127.
3. Solovyev VV, Shahmuradov IA. (2003)
PromH: Promoters identification using orthologous genomic sequences.
Nucleic Acids Res. 31(13):3540-3545.
First line - name of your sequence;
Second and Third lines - LDF threshold and the length of presented sequence
4th line - The number of predicted promoter regions
Next lines - positions of predicted sites, their 'weights' and TATA box position (if found)
Position shows the first nucleotide of the transcript (TSS position)
After that functional motifs are given for each predicted region; (+) or (-) reflects the direct or
complementary chain; Fields like "RSP00004 tagaCACGTaga" mean a particular motif identificator with found
similar sequence from the Softberry Regsite-Plant data base.
tssp Wed Jul 10 02:52:32 EDT 2002 >gi|1902902|dbj|AB001920.1| Oryza sativa (japonica cultivar-group) gene for phos Length of sequence- 5871 Thresholds for TATA+ promoters - 0.02, for TATA-/enhancers - 0.04 2 promoter/enhancer(s) are predicted Promoter Pos: 1522 LDF- 0.13 TATA box at 1488 18.93 Enhancer Pos: 1597 LDF- 0.12 Transcription factor binding sites/RegSite DB: for promoter at position - 1522 1468 (-) RSP00004 tagaCACGTaga 1459 (+) RSP00010 cACGTG 1456 (+) RSP00011 ctccACGTGgt 1461 (+) RSP00016 caTGCAC 1468 (-) RSP00016 caTGCAC 1256 (-) RSP00026 gcttttgaTGACtTcaaacac 1460 (+) RSP00065 ACGTGgcgc 1460 (+) RSP00066 ACGTGccgc 1459 (+) RSP00069 tACGTG 1341 (+) RSP00071 GACGTC 1346 (-) RSP00071 GACGTC 1452 (-) RSP00096 GGTTT 1432 (+) RSP00129 CACGAC 1281 (+) RSP00148 CGACG 1284 (+) RSP00148 CGACG 1315 (+) RSP00148 CGACG 1335 (+) RSP00148 CGACG 1340 (+) RSP00148 CGACG 1365 (+) RSP00148 CGACG 1434 (+) RSP00148 CGACG 1458 (+) RSP00148 CGACG 1347 (-) RSP00148 CGACG 1474 (+) RSP00162 ACACccGagctaaccacaac 1348 (+) RSP00241 CGGTCA 1387 (+) RSP00339 RTTTTTR 1264 (-) RSP00397 AGTGGCGG 1268 (+) RSP00422 ACCGAC 1459 (+) RSP00423 GACGTG 1464 (-) RSP00424 CACGTC 1369 (-) RSP00431 rdygRCRGTTRs 1278 (-) RSP00432 cVacGGTaGGTgg 1249 (-) RSP00436 TTGACT 1260 (+) RSP00463 atttcatggCCGACctgcttttt 1260 (+) RSP00464 acttgatggCCGACctctttttt 1260 (+) RSP00465 aatatactaCCGACcatgagttct 1265 (+) RSP00466 actaCCGACatgagttccaaaaagc 1440 (+) RSP00469 GNGGTG 1260 (-) RSP00469 GNGGTG 1440 (+) RSP00470 GTGGNG 1263 (-) RSP00470 GTGGNG 1257 (-) RSP00470 GTGGNG 1390 (+) RSP00477 TTTAA 1385 (+) RSP00508 gcaTTTTTatca 1502 (-) RSP00508 gcaTTTTTatca 1469 (+) RSP00518 tccctACACgcGtcacaattc 1465 (+) RSP00519 caattcaggACACgtGccctcttca 1474 (+) RSP00521 ACACccG 1474 (+) RSP00523 ACACgcG 1474 (+) RSP00524 ACACgtG for promoter at position - 1597 1468 (-) RSP00004 tagaCACGTaga 1459 (+) RSP00010 cACGTG 1456 (+) RSP00011 ctccACGTGgt 1461 (+) RSP00016 caTGCAC 1468 (-) RSP00016 caTGCAC 1460 (+) RSP00065 ACGTGgcgc 1460 (+) RSP00066 ACGTGccgc 1459 (+) RSP00069 tACGTG 1341 (+) RSP00071 GACGTC 1346 (-) RSP00071 GACGTC 1452 (-) RSP00096 GGTTT 1432 (+) RSP00129 CACGAC 1315 (+) RSP00148 CGACG 1335 (+) RSP00148 CGACG 1340 (+) RSP00148 CGACG 1365 (+) RSP00148 CGACG 1434 (+) RSP00148 CGACG 1458 (+) RSP00148 CGACG 1347 (-) RSP00148 CGACG 1474 (+) RSP00162 ACACccGagctaaccacaac ..............................
Lower cased letters mean non-conserved nucleotides in the site consensus
The letters except (A,T,G,C) describe ambiguous sites in a given DNA sequence motif, where a single character may represent more than one nucleotide using Standard IUPAC Nucleotide code.
See TABLE at http://www.yeastract.com/help/help_searchbydnamotif.php#Ref1
IUPAC Code | Meaning | Origin of Description |
G | G | Guanine |
A | A | Adenine |
T | T | Thymine |
C | C | Cytosine |
R | G or A | puRine |
Y | T or C | pYrimidine |
M | A or C | aMino |
K | G or T | Ketone |
S | G or C | Strong interaction |
W | A or T | Weak interaction |
H | A or C or T | not-G, H follows G in the alphabet |
B | G or T or C | not-A, B follows A in the alphabet |
V | G or C or A | not-T (not-U), V follows U in the alphabet |
D | G or A or T | not-C, D follows C in the alphabet |
N | G or A or T or C | aNy |