Search for regulatory motifs conserved in several sequences.

Regulatory Elements (REs) can be taken from different databases or defined by user (for local runs only). The program finds sites that occur at least in one copy in P% or more of analyzed DNA sequences (in web version P is set to 50%). Input sequences should be in FASTA format, like

 >test1
 AAAAAAAAA
 GGCCCCCCC
 >test2
 ACCCTTTTTC
 CCCCCCCCCC

Method description

As Nsite, Nsite-m is also based on search of statistically significant regulatory site consensus - see NSITE Help for more description.
The main features of the approach are the follows:
(i) RE may consist of a single box (a continuous DNA segment) or two boxes, spaced by some DNA sequence, where only length, but not nucleotide content, of this spacer is important for functioning of such a composite site.
(ii) A real RE or its IUPAC consensus contains both variable positions, where the presence of a certain group of nucleotides is permissible, and strictly conserved positions, where strict identity between real site/consensus and predicted motif is required . The nonequivalence of these positions should be taken into account, i.e., complete homology at conserved positions is required, and a violation of homology in the variable positions should be permissible.
(iii) The homology between RE and a motif on query DNA sequence may be a random happening, therefore, estimation of its statistical significance is very important. A conclusion on functional significance of revealed homology can be reached only if the homology is significantly nonrandom, i.e., the homology is not a random event.
(iv) Characteristics such as nucleotide frequencies should not be used when describing consensus because of its small size. Instead, one should use estimates based on number of specific nucleotides in the consensus.
(v) Although all available RE databases usually annotate fixed distance between two boxes of composite elements, some variability of the spacer length usually takes place. Therefore, search algorithm for composite REs should allow some limited flexibility in spacer length.

Expected occurency for each regulatory motif found must be less than given percentage (default: 5%);
The program currently uses Transfac human/animal and plant datasets (3587 and ~600 real sites/consensuses, respectively). User can perform a search for motifs of REs from his own dataset in a format described below.

Nsite-m output

Output file begins with description of the program allocation, search parameters, as well as, if using our datasets, abbreviations used. Two next lines include name and length of the first query sequence. Then, statistical analysis of search result are presented. At last, names of REs, statistical estimation and sequences of motifs found and are given.



 Program   Nsite-m: Search for Motif Patterns (Softberry Inc.)
____________________________________________________________
 File with QUERY Sequences: H-H.SEQ         
 Search PARAMETERS:
     Expected  Mean  Number                 :  0.0100000
     Print  Query  Sequence                 : No 
     Special numbering of Query Sequence    : No 
     Variation of Distance between RE Blocks: No 
     Create List of Numbered Query Sequences: No 

 NOTE: RE - Regulatory Element/Consensus
       AC - Accession No of RE in TRANSFAC
       OS - Organism/Species
       BF - Binding Factor or One of them
       Mism.             - Mismatches
       Mean. Exp. Number - Mean Expected Number
============================================================
 STATISTICAL ANALYSIS of RESULTS of SEARCH of MOTIFS
       of 3587 REs in    5 SEQUENCES
============================================================
 Motif(s) of  2 REs in  50 %  or more of analyzed sequences


 RE:   429. AC: R00560  OS: human  BF: CACCC-binding 
   ctccacccatggg    
 RE:  1272. AC: R01859  OS: human  BF: CP1  
   gccttgaccaat                  

 FOUND in every of the following    3 ( 60.00 % of all) sequences:
     3    4    5
............................................................
 RE:   738. AC: R01053  OS: mouse  BF: RXR-beta
   tgaggtcaggg                               
 RE:  2751. AC: R03786  OS: empty  BF: PUB1  
   tttatttatgttttcttctgca                                   

 FOUND in every of the following    3 ( 60.00 % of all) sequences:
     1    4    5
____________________________________________________________
SUMMARY: In 2 case(s)  motif(s) of  2 REs found in  50 % or more of analyzed sequences

==================================================
     Motifs of REs found in  50 %  or more of analyzed sequences
............................................................

   1. QUERY: >GB/U01317.1|Human HBB (H-HBB) [60137-->2500 nt]: -2000...+500

 Length of Query Sequence:       2150
 Nucleotide Frequencies:  A -  0.32   G -  0.20   T -  0.30   C -  0.17

............................................................
 RE:   738. AC: R01053  OS: mouse  BF: RXR-beta                                      
           (Found in    3 ( 60.00 %) SEQs)

 Motifs on "-" Strand: Mean Exp. Number   0.00459    Found  1

     783  TGAGGTCAGcG      773 (Mism.= 1)
==============================================================================

RULES for creating USER RE sets:



 1. User sets must include only sequences of actual REs and/or their consensus sequences. 
 2. Every actual RE/consensus is described in three lines:
 	LINE 1: Name/description of RE/consensus    
 	LINE 2: Sequence of of RE/consensus 
 	LINE 3: <par1> <par2> <par3> <par4> 

 3. Sequence (LINE2) may include both standard nucleotides (A/a, T/t, G/g,C/c)  
 and their combinations according to IUPAC abbreviations: 
 R - A or G, Y - T or C, K - G or T, M - A or C, S - G or C, 
 W - A or T, B - G or T or C, D - A or G or T, H - A or C or T, 
 V - A or G or C, N - A or G or C or T.

   In the case of composite REs, two boxes are seperated by "-". 

 Length of RE/consensus sequence must not exceed 80 symbols, including    "-" in 
 case of composite elements.  

 Capital letters indicate Conservative nucleotides (positions) in which  mismatch 
 is not allowed.


 4. In the LINE 3: <par1> - maximal number of mismatches for the first box                         
                  <par2> - maximal number of mismatches for the second box (for 
                           composite REs).
                           If RE contains a single box, then <par2> = 0;
                           If any mismatch is not allowed, then  <par1> = <par2> = 0. 

  		          <par3> - minimal distance between boxes of composite RE 
                  <par4> - maximal distance between boxes of composite RE
                           (for a single-box REs <par3> = <par4> = 0 )

 All <par1> <par2> <par3> and <par4> are given as INTEGERS in 4i5 format. 


 Example of USER's set of 3 REs: 


 RE 1  
 agTGGcgAggcg
     2    0    0    0
 RE2 
 caggccTGc-CCAGctgg 
     1    1    8   10
 RE 3  
 RRTGTGGWWW 
     0    0    0    0

------------------------------------------------------------------------