|
Nsite-m |
Search for regulatory motifs conserved in several sequences.
Regulatory Elements (REs) can be taken from different databases or defined by user (for local runs only). The program finds sites that occur at least in one copy in P% or more of analyzed DNA sequences (in web version P is set to 50%). Input sequences should be in FASTA format, like
>test1 AAAAAAAAA GGCCCCCCC >test2 ACCCTTTTTC CCCCCCCCCC
Method description
As Nsite, Nsite-m is also based on search of statistically significant regulatory
site consensus - see NSITE Help for more description.
The main features of the approach are the follows:
(i) RE may consist of a single box (a continuous DNA segment) or two boxes,
spaced by some DNA sequence, where only length, but not nucleotide content,
of this spacer is important for functioning of such a composite site.
(ii) A real RE or its IUPAC consensus contains both variable positions, where
the presence of a certain group of nucleotides is permissible, and strictly
conserved positions, where strict identity between real site/consensus and predicted
motif is required . The nonequivalence of these positions should be taken into
account, i.e., complete homology at conserved positions is required, and a violation
of homology in the variable positions should be permissible.
(iii) The homology between RE and a motif on query DNA sequence may be a random
happening, therefore, estimation of its statistical significance is very important.
A conclusion on functional significance of revealed homology can be reached
only if the homology is significantly nonrandom, i.e., the homology is not a
random event.
(iv) Characteristics such as nucleotide frequencies should not be used when
describing consensus because of its small size. Instead, one should use estimates
based on number of specific nucleotides in the consensus.
(v) Although all available RE databases usually annotate fixed distance between
two boxes of composite elements, some variability of the spacer length usually
takes place. Therefore, search algorithm for composite REs should allow some
limited flexibility in spacer length.
Expected occurency for each regulatory motif found must be less than given percentage
(default: 5%);
The program currently uses Transfac human/animal and plant datasets (3587 and
~600 real sites/consensuses, respectively). User can perform a search for motifs
of REs from his own dataset in a format described below.
Nsite-m output
Output file begins with description of the program allocation, search parameters,
as well as, if using our datasets, abbreviations used. Two next lines include
name and length of the first query sequence. Then, statistical analysis of search
result are presented. At last, names of REs, statistical estimation and sequences
of motifs found and are given.
Program Nsite-m: Search for Motif Patterns (Softberry Inc.) ____________________________________________________________ File with QUERY Sequences: H-H.SEQ Search PARAMETERS: Expected Mean Number : 0.0100000 Print Query Sequence : No Special numbering of Query Sequence : No Variation of Distance between RE Blocks: No Create List of Numbered Query Sequences: No NOTE: RE - Regulatory Element/Consensus AC - Accession No of RE in TRANSFAC OS - Organism/Species BF - Binding Factor or One of them Mism. - Mismatches Mean. Exp. Number - Mean Expected Number ============================================================ STATISTICAL ANALYSIS of RESULTS of SEARCH of MOTIFS of 3587 REs in 5 SEQUENCES ============================================================ Motif(s) of 2 REs in 50 % or more of analyzed sequences RE: 429. AC: R00560 OS: human BF: CACCC-binding ctccacccatggg RE: 1272. AC: R01859 OS: human BF: CP1 gccttgaccaat FOUND in every of the following 3 ( 60.00 % of all) sequences: 3 4 5 ............................................................ RE: 738. AC: R01053 OS: mouse BF: RXR-beta tgaggtcaggg RE: 2751. AC: R03786 OS: empty BF: PUB1 tttatttatgttttcttctgca FOUND in every of the following 3 ( 60.00 % of all) sequences: 1 4 5 ____________________________________________________________ SUMMARY: In 2 case(s) motif(s) of 2 REs found in 50 % or more of analyzed sequences ================================================== Motifs of REs found in 50 % or more of analyzed sequences ............................................................ 1. QUERY: >GB/U01317.1|Human HBB (H-HBB) [60137-->2500 nt]: -2000...+500 Length of Query Sequence: 2150 Nucleotide Frequencies: A - 0.32 G - 0.20 T - 0.30 C - 0.17 ............................................................ RE: 738. AC: R01053 OS: mouse BF: RXR-beta (Found in 3 ( 60.00 %) SEQs) Motifs on "-" Strand: Mean Exp. Number 0.00459 Found 1 783 TGAGGTCAGcG 773 (Mism.= 1) ==============================================================================
RULES for creating USER RE sets:
1. User sets must include only sequences of actual REs and/or their consensus sequences. 2. Every actual RE/consensus is described in three lines: LINE 1: Name/description of RE/consensus LINE 2: Sequence of of RE/consensus LINE 3: <par1> <par2> <par3> <par4> 3. Sequence (LINE2) may include both standard nucleotides (A/a, T/t, G/g,C/c) and their combinations according to IUPAC abbreviations: R - A or G, Y - T or C, K - G or T, M - A or C, S - G or C, W - A or T, B - G or T or C, D - A or G or T, H - A or C or T, V - A or G or C, N - A or G or C or T. In the case of composite REs, two boxes are seperated by "-". Length of RE/consensus sequence must not exceed 80 symbols, including "-" in case of composite elements. Capital letters indicate Conservative nucleotides (positions) in which mismatch is not allowed. 4. In the LINE 3: <par1> - maximal number of mismatches for the first box <par2> - maximal number of mismatches for the second box (for composite REs). If RE contains a single box, then <par2> = 0; If any mismatch is not allowed, then <par1> = <par2> = 0. <par3> - minimal distance between boxes of composite RE <par4> - maximal distance between boxes of composite RE (for a single-box REs <par3> = <par4> = 0 ) All <par1> <par2> <par3> and <par4> are given as INTEGERS in 4i5 format. Example of USER's set of 3 REs: RE 1 agTGGcgAggcg 2 0 0 0 RE2 caggccTGc-CCAGctgg 1 1 8 10 RE 3 RRTGTGGWWW 0 0 0 0 ------------------------------------------------------------------------