OligsR

The program makes the statistical calculations on redundant oligos (15-mer oligos) and displays the oligos, that differ from expected mean significantly.

Input data

The input file should be in FASTA format and may contain several sequences. Alphabet. The allowed symbols: "ACGTUacgtu" and "NnyYrRBbDdHhKkWwSsMmVv". The symbols to be skipped: "0123456789; \n\r\t\0-". All other symbols are not allowed.

Input parameters

The program processes all oligonucleotides of length L. The L value runs all values in L1 to L2 range.
Minimal olig length (L1) - Minimal olig length
Minimal olig length (L2) - Minimal olig length
Restrictions for L1, L2: 1<=L1 && L1<=L2 && L2<=6.
Computer must have enough memory installed, and the memory size depends on oligo's length.
 
Input file - Input file in FASTA-format.
 
The special mode to print all oligos ignoring any additional conditions. While in this mode the very big output file can be generated.
Print all oligs - Print all oligs, ignore conditions
 
The program can process not only the given sequence but simultaneously build and process the reverse sequence.
Scan target sequence in different chain - Scan target sequence in different chain:
In direct chain only (default)
In reverse chain only
In both chains
 
Similarly to normal distribution, the program can output either most frequent oligos or most rare ones. The following parameter is used for this:
Frequency - Most frequent or least frequent:
most frequent (default)
least frequent
 
To determine which oligos must be output and which ones must not, the value for deviation multiplier range should be defined.
Deviation multiplier is difference between number of oligos and expected number of oligos in sigma units. For more details see the algorithm description chapter.
Deviation multiplier fence - Use the value 3.0 to output 5% of oligos.
 
On oligo output, an additional filtering is made. For each oligo, the percentage of letters 'N' in relation to all letters of oligo is calculated. Oligos, for which this percentage does not exceed the "Percent of N" parameter, are output.
Percent of N - Olig have no more # % of 'N', default is 100.
 
Output file - Output file name.
 
The "shift" parameter sets the value (in nucleotides) of shifting from the sequence start to the position from which oligos are to be generated. If there are several sequences in a file, the shift value affects each of them. The default value is 0.
Shift in sequence - Shift in sequence, default value is 0.
 
The "step" parameter sets the value (in nucleotides) of shifting for generating oligos. In order to get all oligos, this parameter should be set to 1, which is default value.
Step in sequence - Step in sequence (default value is 1)
Sometime it's necessary to check all three reading frames. To do this run the program three times with the following values for "shift" and "step":
1) step=3 shift=0
2) step=3 shift=1
3) step=3 shift=2
 
Input sequences may be either in FASTA format or in specially packed format. The "Softberry" products frequently used to pack large chromosomes into its own "nucfile" or nf format. Sequence file, in this case, has the .nf extension.
If the "Packed file" parameter is not defined the program consider the input file as one in FASTA format. Otherwise the input file format is considered as "nucfile".
Packed file - Input file is packed file (nucfile, nf).
The FASTA file can be converted to the nucfile one using the cvtseq utility.
For example, to convert the FASTA file chr22.fa to the nucfile chr22.nf, use the following command string:

cvtseq chr22.fa chr22.nf -fi -do -t "chr22" -n5gc

Use the following command to check the information on a packed file:


cvtseq chr22.nf -e

Command output:


filename: chr22.nf
pack_mode: PACK_5

size: 49476972 from: 0 nonstandard: 1
title_size: 5 title: chr22

Algorithm

For each defined L the array that contains the number of oligos is built. The sequential number of oligo is used as an index for this array. The total number of oligos is a value of the array.

Further, using this array and defined parameters, program builds the table of oligos that contains more information (mean, deviation multiplier etc). This table is printed into output file.

Total number of all oligos - oligs_sum_count.
Total number of nucleotides - seqs_sum_length.
The oligo's frequency is a multiplication of frequencies of nucleotides it consists of.
The expected mean of the counter (that is equal to oligo's mean) is calculated by the following way:
average= oligs_sum_count*frequence;
Deviation is calculated with use of formula:
deviation = sqrt( oligs_sum_count*frequence*(1-frequence) );
The oligo's counter - olig_count - describes how much times this oligo occurs in a sequence.
Deviation multiplier is calculated with use of formula:
Deviation_multiplier= (olig_count-average)/deviation;
Normalized deviation (norm deviate) of the given oligo is calculated with use of formula:
Norm_deviate= olig_count/seqs_sum_length;

Output data

Example for program output:


Oligsr 1.4  Copyright (c) 2005-2006 Softberry
Num seqs=32 Nucleotides=46705 Average seq length=1459.5
A=25.1% C=24.7% G=24.8% T=25.4%
AC=49.8% AG=49.9% AT=50.5% CG=49.5% CT=50.1% GT=50.2%
ACG=74.6% ACT=75.2% AGT=75.3% CGT=74.9% N=100.0%
Output most frequent oligs, direction=direct, deviation multiplier=10.000000, no more 50.0 % of 'N'
#olig,total olig counter,expected number,deviation,deviation multiplier,unique sequences counter,norm deviate
Length 1

Length 2
TK      6906    5952.4     72.1     13.2     32  0.147864
TG      3544    2939.8     52.5     11.5     32  0.075881
MA      6654    5834.8     71.5     11.5     32  0.142469
GC      3409    2858.0     51.8     10.6     32  0.072990

Length 3
TKB      5574    4455.2     63.5     17.6     32  0.119345
VMA      5390    4349.4     62.8     16.6     32  0.115405
TKS      3731    2943.9     52.5     15.0     32  0.079884
YTK      3772    2980.5     52.8     15.0     32  0.080762
TGS      1993    1453.9     37.5     14.4     32  0.042672
TBB      7724    6647.3     75.5     14.3     32  0.165378
VMW      9944    8751.5     84.3     14.1     32  0.212911
MMA      3639    2903.8     52.2     14.1     32  0.077915
MAR      3639    2909.2     52.2     14.0     32  0.077915
VVA      7555    6514.6     74.9     13.9     32  0.161760
WKB     10034    8857.0     84.7     13.9     32  0.214838
TKY      3711    2980.5     52.8     13.8     32  0.079456
BTK      5330    4455.2     63.5     13.8     32  0.114121
YTB      5315    4447.0     63.4     13.7     32  0.113799
HTK      5343    4473.6     63.6     13.7     32  0.114399
VAR      5214    4357.4     62.9     13.6     32  0.111637
TKK      3706    2986.0     52.9     13.6     32  0.079349
TGB      2820    2200.3     45.8     13.5     32  0.060379
GCH      2754    2148.0     45.3     13.4     32  0.058966
WGC      1942    1442.6     37.4     13.4     32  0.041580
TKN      6904    5948.3     72.0     13.3     32  0.147821
NTK      6901    5948.3     72.0     13.2     32  0.147757
CWG      1936    1442.6     37.4     13.2     32  0.041452
GCW      1936    1442.6     37.4     13.2     32  0.041452
YKB      9894    8786.4     84.4     13.1     32  0.211840
RMA      3590    2909.2     52.2     13.0     32  0.076865
MAV      5157    4349.4     62.8     12.9     32  0.110416
RMW      6771    5853.7     71.5     12.8     32  0.144974
SMA      3551    2885.7     52.0     12.8     32  0.076030
WKS      6767    5852.6     71.5     12.8     32  0.144888
SCW      3540    2879.8     52.0     12.7     32  0.075795
YKS      6708    5806.0     71.3     12.7     32  0.143625
SWG      3548    2890.5     52.1     12.6     32  0.075966
MAA      1937    1463.7     37.7     12.6     32  0.041473
WGS      3545    2890.5     52.1     12.6     32  0.075902
VMR      9694    8645.0     83.9     12.5     32  0.207558
TBS      5180    4392.5     63.1     12.5     32  0.110909
DGC      2716    2150.6     45.3     12.5     32  0.058152
TGC      1057     725.6     26.7     12.4     32  0.022631
VMD     14248   13047.1     96.9     12.4     32  0.305064
HKS      9744    8714.7     84.2     12.2     32  0.208629
SCA      1886    1431.2     37.2     12.2     32  0.040381
YTG      1932    1472.0     37.8     12.2     32  0.041366
BTG      2755    2200.3     45.8     12.1     32  0.058987
TBY      5213    4447.0     63.4     12.1     32  0.111615
HTB      7583    6674.9     75.6     12.0     32  0.162359
HKB     14354   13188.3     97.3     12.0     32  0.307333
VWG      5106    4356.5     62.8     11.9     32  0.109324
SMW      6654    5806.4     71.3     11.9     32  0.142469
AAA      1058     737.7     26.9     11.9     32  0.022653
VAD      7463    6576.3     75.2     11.8     32  0.159790
MAD      5129    4390.6     63.1     11.7     32  0.109817
SMD      9638    8656.5     84.0     11.7     32  0.206359
VAA      2723    2192.3     45.7     11.6     32  0.058302
TGN      3542    2937.8     52.5     11.5     32  0.075838
NTG      3542    2937.8     52.5     11.5     32  0.075838
TGV      2715    2191.4     45.7     11.5     32  0.058131
NMA      6648    5830.8     71.4     11.4     32  0.142340
MAN      6647    5830.8     71.4     11.4     32  0.142319
KSC      3450    2862.0     51.8     11.3     32  0.073868
TTK      1943    1511.3     38.2     11.3     32  0.041602
CWS      3466    2879.8     52.0     11.3     32  0.074210
SMR      6535    5735.8     70.9     11.3     32  0.139921
VCA      2667    2157.1     45.4     11.2     32  0.057103
MWG      3494    2908.6     52.2     11.2     32  0.074810
HTG      2719    2209.4     45.9     11.1     32  0.058216
RVA      5055    4357.4     62.9     11.1     32  0.108233
MVA      5045    4349.4     62.8     11.1     32  0.108018
KSH      9645    8714.7     84.2     11.1     32  0.206509
WKY      6717    5925.3     71.9     11.0     32  0.143818
SVA      5010    4322.3     62.6     11.0     32  0.107269
GMW      3481    2908.6     52.2     11.0     32  0.074532
TSC      1858    1448.5     37.5     10.9     32  0.039782
TGY      1884    1472.0     37.8     10.9     32  0.040338
TTB      2754    2254.9     46.3     10.8     32  0.058966
HGC      2632    2148.0     45.3     10.7     32  0.056354
KSY      6568    5806.0     71.3     10.7     32  0.140627
KGC      1831    1433.7     37.3     10.7     32  0.039204
GCN      3407    2856.1     51.8     10.6     32  0.072947
KSM      6527    5770.7     71.1     10.6     32  0.139749
NGC      3406    2856.1     51.8     10.6     32  0.072926
KBB     14164   13133.9     97.1     10.6     32  0.303265
TKC      1868    1469.2     37.7     10.6     32  0.039996
MAM      3455    2903.8     52.2     10.6     32  0.073975
CTG      1005     725.6     26.7     10.5     32  0.021518
KBY      9669    8786.4     84.4     10.5     32  0.207023
TBC      2669    2192.1     45.7     10.4     32  0.057146
VVM     13931   12924.7     96.7     10.4     32  0.298276
VWK      9698    8821.1     84.6     10.4     32  0.207644
TSS      3442    2902.5     52.2     10.3     32  0.073697
TKG      1863    1474.7     37.8     10.3     32  0.039889
VAV      7283    6514.6     74.9     10.3     32  0.155936
MMR      6501    5771.8     71.1     10.3     32  0.139193
YTS      3475    2938.5     52.5     10.2     32  0.074403
DSC      4930    4293.3     62.4     10.2     32  0.105556
BTB      7412    6647.3     75.5     10.1     32  0.158698
WGB      5012    4374.3     63.0     10.1     32  0.107312
CWK      3450    2920.9     52.3     10.1     32  0.073868
WKC      3450    2920.9     52.3     10.1     32  0.073868
VCW      4972    4340.4     62.7     10.1     32  0.106455
RAA      1844    1466.4     37.7     10.0     32  0.039482
VHD     20770   19703.0    106.7     10.0     32  0.444706

Detailed description for output data:

The program version and name are shown in the first string:

Oligsr 1.4  Copyright (c) 2005-2006 Softberry

Num seqs=32 Nucleotides=46705 Average seq length=1459.5
A=25.1% C=24.7% G=24.8% T=25.4%
AC=49.8% AG=49.9% AT=50.5% CG=49.5% CT=50.1% GT=50.2%
ACG=74.6% ACT=75.2% AGT=75.3% CGT=74.9% N=100.0%

Further there is an information on input file:
Number of fasta-sequences - 32
Number of nucleotides - 46705
Average length of sequence - 1459.
Percentage of letters 'A' - 25.1
Percentage of letters 'C' - 24.7
Percentage of letters 'G' - 24.8
Percentage of letters 'T' - 25.4
Percentage of letters 'A or C' - 49.8
Percentage of letters 'A or G' - 49.9
Percentage of letters 'A or T' - 50.5
Percentage of letters 'C or G' - 49.5
Percentage of letters 'C or T' - 50.1
Percentage of letters 'G or T' - 50.2
Percentage of letters 'A or Ñ or G' - 74.6
Percentage of letters 'A or Ñ or T' - 75.2
Percentage of letters 'A or G or T' - 75.3
Percentage of letters 'C or G or T' - 74.9
Percentage of letters 'A or C or G or T' - 100.0


Output most frequent oligs, direction=direct, deviation multiplier=10.000000, 
no more 50.0 % of 'N'

Further there are defined input parameters:
To output the most frequent oligos - Output most frequent oligs.
To process the direct chain only - direction=direct
Defined range for deviation multiplier - 10.0
To output oligos containing not more than 50% of letters 'N'.


#olig, total olig counter, expected number, deviation, deviation multiplier, 
unique sequences counter, norm deviate

Further there is the hint on table of oligos by columns:
1 column -certain oligo (olig)
2 column - counter for current oligo, i.e. how many times this oligo occurs (total olig counter)
3 column - expected counter mean, i.e. an expected average number of oligos (expected number)
4 column - deviation of current oligo (deviation)
5 column -deviation multiplier value for current oligo (deviation multiplier)
To remind, in given example the range for deviation multiplier was set to 3.0. And since the option to output the most rare oligos was selected, the values in 5th column will be less or equal to -3.0.
6 column - number of sequences, in which this oligo occurs.
7 column - normalized deviation of this oligo.
For more details on values calculation see the chapter "Algorithm"

Length 3

Further there are tables of oligos with various length values.
Hereafter is an example of the table with oligos of length 3.
The length of examined oligo (Length 3) is shown.


TKB      5574    4455.2     63.5     17.6     32  0.119345
VMA      5390    4349.4     62.8     16.6     32  0.115405
TKS      3731    2943.9     52.5     15.0     32  0.079884

Further there is a table sorted by 5th column descend.
If the option to output the most frequent oligos is on, the table will be sorted by 5th column ascend.
Description of values in columns is above in the text.
The first string description:
1 column - certain oligo 'TKB'
2 column - counter for current oligo 5574
3 column - expected mean for oligo 4455.2
4 column - deviation of current oligo 63.5
5 column - deviation multiplier value for current oligo -17.6
6 column - number of sequences, in which this oligo occurs 32
7 column - normalized deviation of this oligo 0.119345