Oligs2

Search for such oligos (4-nucleotide oligos), that occur often in the 1st file and differ significantly in number on comparison of the 1st and 2nd files with sequences.

Input data

The input file should be in FASTA format and may contain several sequences. Alphabet. The allowed symbols: "ACGTUacgtu" and "NnyYrRBbDdHhKkWwSsMmVv". The symbols to be skipped: "0123456789; \n\r\t\0-". All other symbols are not allowed.

Input parameters

The program processes all oligonucleotides of length L. The L value runs all values in L1 to L2 range.
Minimal olig length (L1) - Minimal olig length
Minimal olig length (L2) - Minimal olig length
Restrictions for L1, L2: 1<=L1 && L1<=L2 && L2<=13.
Computer must have enough memory installed, and the memory size depends on oligo's length.
 
Input file 1 - The first input file in FASTA-format.
Input file 2 - The second input file in FASTA-format.
 
Coefficient k defines which one of these two files is most important at sorting the found oligos. It inflicts the sorting order for found oligos only. The default value 1.0 means the equal importance. If the k value is greater than 1.0, it means that the first file is more important, otherwise the second file is more important.
Coefficient k - Which one of the input files is more important for oligo (default 1.0)
Output file - Output file's name.

Algorithm

For the 1st input file the oligs program searches for the most frequent oligos at deviation multiplier = 0.0. The result is saved in temporary file.
For the 2nd input file the oligs program is run with "Print all oligs" option to find all oligos. The result is saved in temporary file.
It is important to search for definitely all oligos since an oligo existing in the 1st file may be represented in small amounts in the 2nd file also, and thus it could be problematic to compare the number of oligos in different files correctly.
For every oligo in the 1st temporary file the program searches for counterpart in the 2nd temporary file. For each oligo (taken from the 1st file) the program calculates the "sorter" value.

The ratio of nucleotides number between files - div_sum_len:
div_sum_len= number of nucleotides in the 1st file/number of nucleotides in the 2nd file;
Coefficient k - input parameter.
olig1_count - how many times oligo occurs in the 1st file.
olig2_count - how many times oligo occurs in the 2nd file.
z= 0.5*olig1_count*(1+k*olig1_count/(olig2_count*div_sum_len))
The "derivation multiplier" value for oligo from the 1st temporary file - olig1_derivat_mult.
sorter=olig1_derivat_mult*z;
The program prints the title from 1st temporary file, then the title from 2nd one, and then all oligos in "sorter" descend order.

Output data

Example for program output:


Oligs2 1.1  Copyright (c) 2005-2006 Softberry
Num seqs=11 Nucleotides=12191 Average seq length=1108.3
A=25.4% C=23.9% G=25.0% T=25.1% N=0.623411% Other=0.000000%
Output most frequent oligs, direction=direct, seq_shift=0, seq_step=1
deviation multiplier=0.000000
Num seqs=17 Nucleotides=13702 Average seq length=806.0
A=28.8% C=21.4% G=21.8% T=28.0% N=0.000000% Other=0.000000%
Output most frequent oligs, direction=direct, seq_shift=0, seq_step=1
all by distant
#olig,total olig counter1,expected number1,unique sequences counter1,total olig counter2,
unique sequences counter2,norm deviate1,norm deviate 2,sorter
Length 2
TG       899     764.6     11       954     17  0.073743  0.069625    4627.9
CA       873     738.4     11       927     17  0.071610  0.067654    4582.5
GC       832     727.2     11       830     17  0.068247  0.060575    3538.7
TT       871     768.9     11      1296     17  0.071446  0.094585    2905.0
AA       875     784.0     11      1414     17  0.071774  0.103197    2522.1
GA       842     772.1     11       759     17  0.069067  0.055393    2459.4
TC       788     731.2     11       744     17  0.064638  0.054299    1898.7
AT       804     776.4     11      1067     17  0.065950  0.077872     742.5
AG       786     772.1     11       755     17  0.064474  0.055101     426.4

Length 3
CTG       260     182.5     11       210     17  0.021327  0.015326    1803.2
TTT       278     193.0     11       482     17  0.022804  0.035177    1420.5
CAG       247     184.3     11       207     17  0.020261  0.015107    1358.9
CCA       237     176.3     11       232     17  0.019441  0.016932    1171.0
TGC       242     182.5     11       261     17  0.019851  0.019048    1087.2
TGG       246     190.9     11       242     17  0.020179  0.017662    1054.1
AAA       268     198.7     11       568     17  0.021983  0.041454    1025.3
GGA       239     192.7     11       183     17  0.019605  0.013356    1002.7
TCC       222     174.6     11       167     17  0.018210  0.012188     996.6
TTC       235     183.6     11       236     17  0.019277  0.017224     946.2
GCA       234     184.3     11       236     17  0.019194  0.017224     915.3
GAA       243     195.7     11       239     17  0.019933  0.017443     885.2
AGC       229     184.3     11       207     17  0.018784  0.015107     847.7
GCT       227     182.5     11       222     17  0.018620  0.016202     805.0
ATC       223     185.4     11       204     17  0.018292  0.014888     695.8
CAT       224     185.4     11       233     17  0.018374  0.017005     675.8
GAG       223     192.7     11       161     17  0.018292  0.011750     627.2
CAA       228     187.2     11       315     17  0.018702  0.022989     620.2
ATG       226     193.8     11       247     17  0.018538  0.018027     527.2
AAG       227     195.7     11       273     17  0.018620  0.019924     505.0
GCC       202     173.6     11       215     17  0.016570  0.015691     456.8
TCA       210     185.4     11       210     17  0.017226  0.015326     401.4
GAT       214     193.8     11       204     17  0.017554  0.014888     349.7
CGA       202     184.3     11       184     17  0.016570  0.013429     293.3
ATT       216     194.9     11       341     17  0.017718  0.024887     277.3
CTT       202     183.6     11       245     17  0.016570  0.017881     272.4
GTG       207     190.9     11       205     17  0.016980  0.014961     265.2
TGA       207     193.8     11       206     17  0.016980  0.015034     220.4
TTG       206     191.9     11       292     17  0.016898  0.021311     184.7
TGT       204     191.9     11       245     17  0.016734  0.017881     177.7
AGG       198     192.7     11       161     17  0.016241  0.011750      94.3
CGC       177     173.6     11       160     17  0.014519  0.011677      59.6
ACA       190     187.2     11       248     17  0.015585  0.018100      35.4
AAT       200     196.8     11       340     17  0.016406  0.024814      33.2
GGC       183     181.5     11       202     17  0.015011  0.014742      18.5

Detailed description for output data:

The program version and name are shown in the first string:

Oligs2 1.1  Copyright (c) 2005-2006 Softberry

Num seqs=11 Nucleotides=12191 Average seq length=1108.3
A=25.4% C=23.9% G=25.0% T=25.1% N=0.623411% Other=0.000000%
Output most frequent oligs, direction=direct, seq_shift=0, seq_step=1
deviation multiplier=0.000000

It is the title for first program run. It is information on 1st input file:
Number of fasta-sequences - 11
Number of nucleotides - 12191
Average length of sequence - 1108.3


Num seqs=17 Nucleotides=13702 Average seq length=806.0
A=28.8% C=21.4% G=21.8% T=28.0% N=0.000000% Other=0.000000%
Output most frequent oligs, direction=direct, seq_shift=0, seq_step=1
all by distant

It is the title for second program run. It is information on 2nd input file:
Number of fasta-sequences - 17
Number of nucleotides - 13702
Average length of sequence - 806.0


#olig,total olig counter1,expected number1,unique sequences counter1,total olig counter2,
unique sequences counter2,norm deviate1,norm deviate 2,sorter

Further the hint for table of oligos by columns is sown:
1 column - certain oligo (olig)
2 column - counter for current oligo in the 1st file, i.e. how many times this oligo occurs in the 1st file (total olig counter1)
3 column - expected counter mean for the 1st file, i.e. an expected average number of oligos in the 1st file (expected number1)
4 column - number of sequences form the 1st file, in which this oligo occurs (unique sequences counter1).
5 column - counter for current oligo in the 2nd file, i.e. how many times this oligo occurs in the 2nd file (total olig counter2)
6 column - number of sequences form the 2nd file, in which this oligo occurs (unique sequences counter2)
7 column - normalized deviation of this oligo for the 1st file (norm deviate1).
8 column - normalized deviation of this oligo for the 2nd file (norm deviate2).
9 column - "sorter" value for current oligo (sorter).
For more details on how various values are calculated see chapter "algorithm".

Length 3
Further there are tables of oligos of different length.
Example for table of oligos of length 3
Here the length of the current oligo (Length 3)


CTG       260     182.5     11       210     17  0.021327  0.015326    1803.2
TTT       278     193.0     11       482     17  0.022804  0.035177    1420.5
CAG       247     184.3     11       207     17  0.020261  0.015107    1358.9

Further there is the table sorted by descend of 9th column.
Columns description is above in the text.
Description of the first string:
1 column - certain oligo 'CTG'
2 column - counter for current oligo in the 1st file 260
3 column - expected counter mean for the 1st file 182.5
4 column - number of sequences form the 1st file, in which this oligo occurs, 11
5 column - counter for current oligo in the 2nd file 210
6 column - number of sequences form the 2nd file, in which this oligo occurs 17
7 column - normalized deviation of this oligo for the 1st file 0.021327
8 column - normalized deviation of this oligo for the 2nd file 0.015326
9 column - "sorter" value for current oligo 1803.2