|
Oligs2 |
Search for such oligos (4-nucleotide oligos), that occur often in the 1st file and differ significantly in number on comparison of the 1st and 2nd files with sequences.
The input file should be in FASTA format and may contain several sequences. Alphabet. The allowed symbols: "ACGTUacgtu" and "NnyYrRBbDdHhKkWwSsMmVv". The symbols to be skipped: "0123456789; \n\r\t\0-". All other symbols are not allowed.
The program processes all oligonucleotides of length L. The L value runs all values in L1 to L2 range. | ||
Minimal olig length (L1) | - | Minimal olig length |
Minimal olig length (L2) | - | Minimal olig length |
Restrictions for L1, L2: 1<=L1 && L1<=L2 && L2<=13. | ||
Computer must have enough memory installed, and the memory size depends on oligo's length. | ||
Input file 1 | - | The first input file in FASTA-format. |
Input file 2 | - | The second input file in FASTA-format. |
Coefficient k defines which one of these two files is most important at sorting the found oligos. It inflicts the sorting order for found oligos only. The default value 1.0 means the equal importance. If the k value is greater than 1.0, it means that the first file is more important, otherwise the second file is more important. | ||
Coefficient k | - | Which one of the input files is more important for oligo (default 1.0) |
Output file | - | Output file's name. |
For the 1st input file the oligs program searches for the most frequent oligos at deviation multiplier = 0.0.
The result is saved in temporary file.
For the 2nd input file the oligs program is run with "Print all oligs" option to find all oligos.
The result is saved in temporary file.
It is important to search for definitely all oligos since an oligo existing in the 1st file may be represented
in small amounts in the 2nd file also, and thus it could be problematic to compare the number
of oligos in different files correctly.
For every oligo in the 1st temporary file the program searches for counterpart in the 2nd temporary file.
For each oligo (taken from the 1st file) the program calculates the "sorter" value.
The ratio of nucleotides number between files - div_sum_len:
div_sum_len= number of nucleotides in the 1st file/number of nucleotides in the 2nd file;
Coefficient k - input parameter.
olig1_count - how many times oligo occurs in the 1st file.
olig2_count - how many times oligo occurs in the 2nd file.
z= 0.5*olig1_count*(1+k*olig1_count/(olig2_count*div_sum_len))
The "derivation multiplier" value for oligo from the 1st temporary file - olig1_derivat_mult.
sorter=olig1_derivat_mult*z;
The program prints the title from 1st temporary file, then the title from 2nd one,
and then all oligos in "sorter" descend order.
Example for program output:
Oligs2 1.1 Copyright (c) 2005-2006 Softberry Num seqs=11 Nucleotides=12191 Average seq length=1108.3 A=25.4% C=23.9% G=25.0% T=25.1% N=0.623411% Other=0.000000% Output most frequent oligs, direction=direct, seq_shift=0, seq_step=1 deviation multiplier=0.000000 Num seqs=17 Nucleotides=13702 Average seq length=806.0 A=28.8% C=21.4% G=21.8% T=28.0% N=0.000000% Other=0.000000% Output most frequent oligs, direction=direct, seq_shift=0, seq_step=1 all by distant #olig,total olig counter1,expected number1,unique sequences counter1,total olig counter2, unique sequences counter2,norm deviate1,norm deviate 2,sorter Length 2 TG 899 764.6 11 954 17 0.073743 0.069625 4627.9 CA 873 738.4 11 927 17 0.071610 0.067654 4582.5 GC 832 727.2 11 830 17 0.068247 0.060575 3538.7 TT 871 768.9 11 1296 17 0.071446 0.094585 2905.0 AA 875 784.0 11 1414 17 0.071774 0.103197 2522.1 GA 842 772.1 11 759 17 0.069067 0.055393 2459.4 TC 788 731.2 11 744 17 0.064638 0.054299 1898.7 AT 804 776.4 11 1067 17 0.065950 0.077872 742.5 AG 786 772.1 11 755 17 0.064474 0.055101 426.4 Length 3 CTG 260 182.5 11 210 17 0.021327 0.015326 1803.2 TTT 278 193.0 11 482 17 0.022804 0.035177 1420.5 CAG 247 184.3 11 207 17 0.020261 0.015107 1358.9 CCA 237 176.3 11 232 17 0.019441 0.016932 1171.0 TGC 242 182.5 11 261 17 0.019851 0.019048 1087.2 TGG 246 190.9 11 242 17 0.020179 0.017662 1054.1 AAA 268 198.7 11 568 17 0.021983 0.041454 1025.3 GGA 239 192.7 11 183 17 0.019605 0.013356 1002.7 TCC 222 174.6 11 167 17 0.018210 0.012188 996.6 TTC 235 183.6 11 236 17 0.019277 0.017224 946.2 GCA 234 184.3 11 236 17 0.019194 0.017224 915.3 GAA 243 195.7 11 239 17 0.019933 0.017443 885.2 AGC 229 184.3 11 207 17 0.018784 0.015107 847.7 GCT 227 182.5 11 222 17 0.018620 0.016202 805.0 ATC 223 185.4 11 204 17 0.018292 0.014888 695.8 CAT 224 185.4 11 233 17 0.018374 0.017005 675.8 GAG 223 192.7 11 161 17 0.018292 0.011750 627.2 CAA 228 187.2 11 315 17 0.018702 0.022989 620.2 ATG 226 193.8 11 247 17 0.018538 0.018027 527.2 AAG 227 195.7 11 273 17 0.018620 0.019924 505.0 GCC 202 173.6 11 215 17 0.016570 0.015691 456.8 TCA 210 185.4 11 210 17 0.017226 0.015326 401.4 GAT 214 193.8 11 204 17 0.017554 0.014888 349.7 CGA 202 184.3 11 184 17 0.016570 0.013429 293.3 ATT 216 194.9 11 341 17 0.017718 0.024887 277.3 CTT 202 183.6 11 245 17 0.016570 0.017881 272.4 GTG 207 190.9 11 205 17 0.016980 0.014961 265.2 TGA 207 193.8 11 206 17 0.016980 0.015034 220.4 TTG 206 191.9 11 292 17 0.016898 0.021311 184.7 TGT 204 191.9 11 245 17 0.016734 0.017881 177.7 AGG 198 192.7 11 161 17 0.016241 0.011750 94.3 CGC 177 173.6 11 160 17 0.014519 0.011677 59.6 ACA 190 187.2 11 248 17 0.015585 0.018100 35.4 AAT 200 196.8 11 340 17 0.016406 0.024814 33.2 GGC 183 181.5 11 202 17 0.015011 0.014742 18.5
The program version and name are shown in the first string:
Oligs2 1.1 Copyright (c) 2005-2006 Softberry
Num seqs=11 Nucleotides=12191 Average seq length=1108.3 A=25.4% C=23.9% G=25.0% T=25.1% N=0.623411% Other=0.000000% Output most frequent oligs, direction=direct, seq_shift=0, seq_step=1 deviation multiplier=0.000000
It is the title for first program run. It is information on 1st input file:
Number of fasta-sequences - 11
Number of nucleotides - 12191
Average length of sequence - 1108.3
Num seqs=17 Nucleotides=13702 Average seq length=806.0 A=28.8% C=21.4% G=21.8% T=28.0% N=0.000000% Other=0.000000% Output most frequent oligs, direction=direct, seq_shift=0, seq_step=1 all by distant
It is the title for second program run. It is information on 2nd input file:
Number of fasta-sequences - 17
Number of nucleotides - 13702
Average length of sequence - 806.0
#olig,total olig counter1,expected number1,unique sequences counter1,total olig counter2, unique sequences counter2,norm deviate1,norm deviate 2,sorter
Further the hint for table of oligos by columns is sown:
1 column - certain oligo (olig)
2 column - counter for current oligo in the 1st file, i.e. how many times this oligo occurs in the 1st file (total olig counter1)
3 column - expected counter mean for the 1st file, i.e. an expected average number of oligos in the 1st file (expected number1)
4 column - number of sequences form the 1st file, in which this oligo occurs (unique sequences counter1).
5 column - counter for current oligo in the 2nd file, i.e. how many times this oligo occurs in the 2nd file (total olig counter2)
6 column - number of sequences form the 2nd file, in which this oligo occurs (unique sequences counter2)
7 column - normalized deviation of this oligo for the 1st file (norm deviate1).
8 column - normalized deviation of this oligo for the 2nd file (norm deviate2).
9 column - "sorter" value for current oligo (sorter).
For more details on how various values are calculated see chapter "algorithm".
Length 3
Further there are tables of oligos of different length.
Example for table of oligos of length 3
Here the length of the current oligo (Length 3)
CTG 260 182.5 11 210 17 0.021327 0.015326 1803.2 TTT 278 193.0 11 482 17 0.022804 0.035177 1420.5 CAG 247 184.3 11 207 17 0.020261 0.015107 1358.9
Further there is the table sorted by descend of 9th column.
Columns description is above in the text.
Description of the first string:
1 column - certain oligo 'CTG'
2 column - counter for current oligo in the 1st file 260
3 column - expected counter mean for the 1st file 182.5
4 column - number of sequences form the 1st file, in which this oligo occurs, 11
5 column - counter for current oligo in the 2nd file 210
6 column - number of sequences form the 2nd file, in which this oligo occurs 17
7 column - normalized deviation of this oligo for the 1st file 0.021327
8 column - normalized deviation of this oligo for the 2nd file 0.015326
9 column - "sorter" value for current oligo 1803.2