Search for of consensus patterns with statistical estimation.

Nsite can be used for analysis of regulatory regions and composition of their functional motifs.

Method description:

The method is based on statistical estimation of expected number of a nucleotide consensus pattern in a given sequence [1-2,4]. It uses the Nsite formatted datafile, which can include any set of consensus sequences of functional motifs. In current version this file consists of the release of Transfac sequences (3.4, 1998, academic release), composite elements [3] and a set additional functional motifs.

If we find a pattern which has expected number significantly less than 1, it can be supposed that the analyzed sequence possesses the pattern's function.

In the output of Nsite we can see a pattern, its position in the sequence, accession number, ID, Description of motif and binding factor name from the original database if exist.

Table 1. Summary of single-letter code recommendations

SymbolMeaningOrigin of designation
GGGuanine
AAAdenine
TTThymine
CCCytosine
RG or ApuRine
YT or CpYrimidine
MA or CaMino
KG or TKeto
SG or CStrong interaction (3 H bonds)
WA or TWeak interaction (2 H bonds)
HA or C or Tnot-G, H follows G in the alphabet
BG or T or Cnot-A, B follows A
VG or C or Anot-T (not-U), V follows U
DG or A or Tnot-C, D follows C
NG or A or T or CaNy

Output example:


Program   NSITE  (Softberry Inc.)    | Version 2.2004
Search for motifs of   1500 Regulatory Elements (REs)     | SET of REs: REGSITE DB (Transcription Regulatory Sites from human and animals) [ Last Update: March 10, 2006]
____________________________________________________________
 Search PARAMETRS:
     Expected  Mean  Number                      :  0.0000000
     Statistical Siginicance Level               :  0.0000000
     Level of homology between known RE and motif:   80%
     Variation of Distance between RE Blocks     :   20%
 NOTE: RE - Regulatory Element/Consensus   | AC - Accession No of RE in a given DB
       OS - Organism/Species   | BF - Binding Factor or One of them
       Mism. - Mismatches   | Mean. Exp. Number - Mean Expected Number   | Up.Conf.Int. - Upper Confidence Interval
============================================================
 QUERY: >test_nsite.seq
 Length of Query Sequence:       2319 bp     | Nucleotide Frequencies:  A -  0.33   G -  0.19   T -  0.30   C -  0.18

............................................................
 RE:   620. AC: RSA00620//OS: chicken /GENE: BGP/RE: G-string /BF: erythrocyte-specific protein
 Motifs on "-" Strand: Mean Exp. Number   0.00000     Up.Conf.Int.  1     Found   5
    2216  cGGGGGGGGGGGGGGG     2201 (Mism.= 1)
    2215  GGGGGGGGGGGGGGGG     2200 (Mism.= 0)
    2214  GGGGGGGGGGGGGGGG     2199 (Mism.= 0)
    2213  GGGGGGGGGGGGGGGG     2198 (Mism.= 0)
    2212  GGGGGGGGGGGGGGGt     2197 (Mism.= 1)
............................................................
 Totally       5 motifs of     1 different REs have been found
------------------------------------------------------------

Reference:

[1] Shahmuradov K.A. Kolchanov N.A.Solovyev V.V.Ratner V.A.
Enhancer-like structures in middle repetitive sequences of the eukaryotic genomes.
Genetics (Russ),22, 357-368,(1986).

[2] Solovyev V.V., Kolchanov N.A. 1994,
Search for functional sites using consensus In Computer analysis of Genetic macromolecules. (eds. Kolchanov N.A., Lim H.A.),
World Scientific, p.16-21.

[3] Heinemeyer, T., Chen, X., Karas, H., Kel, A. E., Kel, O. V., Liebich, I., Meinhardt, T., Reuter, I., Schacherer, F., Wingender, E. (1999).
Expanding the TRANSFAC database towards an expert system of regulatory olecular

Solovyev V.V. (2002) Structure, Properties and Computer Identification of Eukaryotic genes. In Bioinformatics from Genomes to Drugs. V.1. Basic Technologies. (ed. Lengauer T.), p. 59 - 111.