Oligs

The program makes statistical calculations on oligonucleotides (4-nucleotides ) and shows the ones of significant differences to expected mean.

Input data

The input file should be in FASTA format and may contain several sequences. Alphabet. The allowed symbols: "ACGTUacgtu" and "NnyYrRBbDdHhKkWwSsMmVv". The symbols to be skipped: "0123456789; \n\r\t\0-". All other symbols are not allowed.

Input parameters

The program processes all oligonucleotides of length L. The L value runs all values in L1 to L2 range.
Minimal olig length (L1) - Minimal olig length
Minimal olig length (L2) - Minimal olig length
Restrictions for L1, L2: 1<=L1 && L1<=L2 && L2<=13.
Computer must have enough memory installed, and the memory size depends on oligo's length.
 
Input file - Input file in FASTA-format.
 
The special mode to print all oligos ignoring any additional conditions. While in this mode the very big output file can be generated.
Print all oligs - Print all oligs, ignore conditions
 
The program can process not only the given sequence but simultaneously build and process the reverse sequence.
Scan target sequence in different chain - Scan target sequence in different chain:
In direct chain only (default)
In reverse chain only
In both chains
 
Similarly to normal distribution, the program can output either most frequent oligos or most rare ones. The following parameter is used for this:
Frequency - Most frequent or least frequent:
most frequent (default)
least frequent
 
To determine which oligos must be output and which ones must not, the value for deviation multiplier range should be defined.
Deviation multiplier is difference between number of oligos and expected number of oligos in sigma units. For more details see the algorithm description chapter.
Deviation multiplier fence - Use the value 3.0 to output 5% of oligos.
 
Output file - Output file name.
 
The "shift" parameter sets the value (in nucleotides) of shifting from the sequence start to the position from which oligos are to be generated. If there are several sequences in a file, the shift value affects each of them. The default value is 0.
Shift in sequence - Shift in sequence, default value is 0.
 
The "step" parameter sets the value (in nucleotides) of shifting for generating oligos. In order to get all oligos, this parameter should be set to 1, which is default value.
Step in sequence - Step in sequence (default value is 1)
Sometime it's necessary to check all three reading frames. To do this run the program three times with the following values for "shift" and "step":
1) step=3 shift=0
2) step=3 shift=1
3) step=3 shift=2
 
Input sequences may be either in FASTA format or in specially packed format. The "Softberry" products frequently used to pack large chromosomes into its own "nucfile" or nf format. Sequence file, in this case, has the .nf extension.
If the "Packed file" parameter is not defined the program consider the input file as one in FASTA format. Otherwise the input file format is considered as "nucfile".
Packed file - Input file is packed file (nucfile, nf).
The FASTA file can be converted to the nucfile one using the cvtseq utility.
For example, to convert the FASTA file chr22.fa to the nucfile chr22.nf, use the following command string:

cvtseq chr22.fa chr22.nf -fi -do -t "chr22" -n5gc

Use the following command to check the information on a packed file:


cvtseq chr22.nf -e

Command output:


filename: chr22.nf
pack_mode: PACK_5

size: 49476972 from: 0 nonstandard: 1
title_size: 5 title: chr22

Algorithm

For each defined L the array that contains the number of oligos is built. The sequential number of oligo is used as an index for this array. The total number of oligos is a value of the array.
Further, using this array and defined parameters, program builds the table of oligos that contains more information (mean, deviation multiplier etc). This table is printed into output file.
Total number of all oligos - oligs_sum_count.
Total number of nucleotides - seqs_sum_length.
The oligo's frequency is a multiplication of frequencies of nucleotides it consists of.
The expected mean of the counter (that is equal to oligo's mean) is calculated by the following way:
average= oligs_sum_count*frequence;
Deviation is calculated with use of formula:
deviation = sqrt( oligs_sum_count*frequence*(1-frequence) );
The oligo's counter - olig_count - describes how much times this oligo occurs in a sequence.
Deviation multiplier is calculated with use of formula:
Deviation_multiplier= (olig_count-average)/deviation;
Normalized deviation (norm deviate) of the given oligo is calculated with use of formula:
Norm_deviate= olig_count/seqs_sum_length;

Output data


Example for program output:

Oligs 1.6  Copyright (c) 2005-2006 Softberry
Num seqs=32 Nucleotides=46705 Average seq length=1459.5
A=25.1% C=24.7% G=24.8% T=25.4% N=0.000000% Other=0.000000%
Output least frequent oligs, direction=direct, seq_shift=0, seq_step=1
deviation multiplier=3.000000
#olig,total olig counter,expected number,deviation,deviation multiplier,unique sequences counter,norm deviate
Length 2  oligs=46673
TA      2174    2976.6     52.8    -15.2     32  0.046547
CG      2461    2858.0     51.8     -7.7     32  0.052692
GT      2609    2939.8     52.5     -6.3     32  0.055861
AC      2579    2893.8     52.1     -6.0     32  0.055219
GG      2662    2868.7     51.9     -4.0     32  0.056996

Length 3  oligs=46641
TAG       412     737.4     26.9    -12.1     32  0.008821
CTA       446     734.7     26.9    -10.7     32  0.009549
GTA       511     737.4     26.9     -8.4     32  0.010941
TAC       509     734.7     26.9     -8.4     31  0.010898
CGT       519     725.6     26.7     -7.7     32  0.011112
GGG       508     710.7     26.5     -7.7     32  0.010877
GTC       539     725.6     26.7     -7.0     32  0.011541
ACG       549     716.9     26.6     -6.3     32  0.011755
GAC       551     716.9     26.6     -6.2     32  0.011797
CCC       545     702.8     26.3     -6.0     32  0.011669
CGG       550     708.1     26.4     -6.0     32  0.011776
TTA       608     755.7     27.3     -5.4     32  0.013018
ATA       607     746.7     27.1     -5.2     31  0.012996
TAT       626     755.7     27.3     -4.8     32  0.013403
ACC       595     714.3     26.5     -4.5     32  0.012740
TAA       627     746.7     27.1     -4.4     32  0.013425
GGT       619     728.3     26.8     -4.1     32  0.013253
TCA       631     734.7     26.9     -3.9     32  0.013510
AGT       640     737.4     26.9     -3.6     32  0.013703
CCG       611     705.4     26.4     -3.6     32  0.013082
ACT       651     734.7     26.9     -3.1     32  0.013939

Length 4  oligs=46609
CTAG        73     182.0     13.5     -8.1     26  0.001563
GGGG        71     176.1     13.2     -7.9     24  0.001520
TAGG        83     182.7     13.5     -7.4     24  0.001777
CCTA        85     181.3     13.4     -7.2     26  0.001820
CGTA        92     182.0     13.5     -6.7     26  0.001970
TAGT       104     187.2     13.7     -6.1     26  0.002227
TTAG       105     187.2     13.7     -6.0     25  0.002248
ACGT       101     182.0     13.5     -6.0     29  0.002163
TACG       104     182.0     13.5     -5.8     22  0.002227
TAGA       108     185.0     13.6     -5.7     27  0.002312
TCTA       111     186.5     13.6     -5.5     27  0.002377
GGTA       110     182.7     13.5     -5.4     24  0.002355
ACTA       112     184.3     13.5     -5.3     29  0.002398
ACCC       106     176.3     13.3     -5.3     26  0.002270
GTCA       111     182.0     13.5     -5.3     26  0.002377
TAAC       113     184.3     13.5     -5.3     29  0.002419
CTAT       115     186.5     13.6     -5.2     29  0.002462
ATAG       115     185.0     13.6     -5.2     26  0.002462
CGGT       111     179.8     13.4     -5.1     30  0.002377
CGTC       111     179.1     13.4     -5.1     29  0.002377
CGGG       109     175.4     13.2     -5.0     29  0.002334
GATA       118     185.0     13.6     -4.9     27  0.002526
TATC       120     186.5     13.6     -4.9     30  0.002569
TACC       116     181.3     13.4     -4.9     26  0.002484
TAGC       117     182.0     13.5     -4.8     27  0.002505
TTAC       121     186.5     13.6     -4.8     28  0.002591
GTAG       119     182.7     13.5     -4.7     28  0.002548
ATAC       123     184.3     13.5     -4.5     26  0.002634
GGGT       121     180.4     13.4     -4.4     26  0.002591
CCCT       120     178.4     13.3     -4.4     29  0.002569
CGCG       117     174.8     13.2     -4.4     26  0.002505
GGTC       122     179.8     13.4     -4.3     29  0.002612
CTAA       126     184.3     13.5     -4.3     31  0.002698
GACC       120     177.0     13.3     -4.3     27  0.002569
TAAG       127     185.0     13.6     -4.3     30  0.002719
GTCT       127     184.2     13.5     -4.2     30  0.002719
CTTA       129     186.5     13.6     -4.2     31  0.002762
GTAA       128     185.0     13.6     -4.2     28  0.002741
ACGG       122     177.6     13.3     -4.2     30  0.002612
GACT       126     182.0     13.5     -4.2     31  0.002698
TCAT       130     186.5     13.6     -4.1     29  0.002783
AGAC       125     179.8     13.4     -4.1     28  0.002676
GTAT       132     187.2     13.7     -4.0     25  0.002826
CCCG       121     174.1     13.2     -4.0     28  0.002591
TACT       132     186.5     13.6     -4.0     29  0.002826
TGAC       129     182.0     13.5     -3.9     30  0.002762
CCGG       123     174.8     13.2     -3.9     27  0.002634
ACCG       125     177.0     13.3     -3.9     29  0.002676
ATTA       136     189.6     13.7     -3.9     29  0.002912
CCCC       123     173.5     13.1     -3.8     25  0.002634
AGTC       132     182.0     13.5     -3.7     26  0.002826
GTAC       132     182.0     13.5     -3.7     26  0.002826
CTAC       132     181.3     13.4     -3.7     31  0.002826
TCAC       132     181.3     13.4     -3.7     30  0.002826
CATA       135     184.3     13.5     -3.6     27  0.002890
AGTA       137     185.0     13.6     -3.5     29  0.002933
GCGT       136     179.8     13.4     -3.3     29  0.002912
GCTA       138     182.0     13.5     -3.3     28  0.002955
TCGT       140     184.2     13.5     -3.3     31  0.002998
GTTA       143     187.2     13.7     -3.2     29  0.003062
GAGT       140     182.7     13.5     -3.2     29  0.002998
TCGG       138     179.8     13.4     -3.1     31  0.002955

Detailed description for output data:

The program version and name are shown in the first string:


Oligs 1.6  Copyright (c) 2005-2006 Softberry

Num seqs=32 Nucleotides=46705 Average seq length=1459.5
A=25.1% C=24.7% G=24.8% T=25.4% N=0.000000% Other=0.000000%

Further there is an information on input file:
Number of fasta-sequences - 32
Number of nucleotides - 46705
Average length of sequence - 1459.5
Percentage of 'A' - 25.1
Percentage of 'C' - 24.7
Percentage of 'G' - 24.8
Percentage of 'T' - 25.4
Percentage of 'N' - 0.0
Percentage of other letters (except A,C,G,T,N ) - 0.0


Output least frequent oligs, direction=direct, seq_shift=0, seq_step=1
deviation multiplier=3.000000

Further there are defined input parameters:
To show the most rare oligos - Output least frequent oligs.
Process the direct chain only - direction=direct
The "Shift" parameter - 0
The "Step" parameter - 1
Defined deviation multiplier range - 3.0


#olig,total olig counter,expected number,deviation,deviation multiplier,unique sequences counter,norm deviate

Further there is a hint for table of oligos on each column:
1 column - the specific oligo (olig)
2 column - the counter of this oligo, i.e. how much times this oligo occurs (total olig counter)
3 column - the expected counter mean value, i.e. expected average number of oligos (expected number)
4 column - the deviation of the current oligo (deviation)
5 column - the value of deviation multiplier for the current oligo (deviation multiplier) Note that in this example the value for deviation multipler range was set to 3.0. And since the mode to output the rarest oligos was chosen, the values in 5 column will be less or equal to -3.0.
6 column - the number of sequences containing the current oligo (unique sequences counter).
7 column - normalized deviation of the current oligo (norm deviate).
For more details on how various values are calculated see chapter "algorithm".


Length 3  oligs=46641

Further there are tables of oligos of different length.
Example for table of oligos of length 3
Here the length of the current oligo (Length 3) and total number of oligos of this length (oligs=46641) are shown.


TAG       412     737.4     26.9    -12.1     32  0.008821
CTA       446     734.7     26.9    -10.7     32  0.009549
GTA       511     737.4     26.9     -8.4     32  0.010941

Further there is the table with 5 column's values sorted by descending.
If it will be chosen the parameter to output the most frequent oligos, the values in 5 column will be sorted by ascending.
Description of values is shown earlier in the text.
The first string description.
1 column - The current oligo 'TAG'
2 column - The counter of the current oligo is 412
3 column - The expected oligo's mean is 737.4
4 column - The deviation for the current oligo is 26.9
5 column - The value for deviation multiplier for the current oligo is -12.1
6 column - The total number of sequences containing the current oligo is 32
7 column - Normalized deviation is 0.008821