|
OligsR |
The program makes the statistical calculations on redundant oligos (15-mer oligos) and displays the oligos, that differ from expected mean significantly.
The input file should be in FASTA format and may contain several sequences. Alphabet. The allowed symbols: "ACGTUacgtu" and "NnyYrRBbDdHhKkWwSsMmVv". The symbols to be skipped: "0123456789; \n\r\t\0-". All other symbols are not allowed.
The program processes all oligonucleotides of length L. The L value runs all values in L1 to L2 range. | ||
Minimal olig length (L1) | - | Minimal olig length |
Minimal olig length (L2) | - | Minimal olig length |
Restrictions for L1, L2: 1<=L1 && L1<=L2 && L2<=6. | ||
Computer must have enough memory installed, and the memory size depends on oligo's length. | ||
Input file | - | Input file in FASTA-format. |
The special mode to print all oligos ignoring any additional conditions. While in this mode the very big output file can be generated. | ||
Print all oligs | - | Print all oligs, ignore conditions |
The program can process not only the given sequence but simultaneously build and process the reverse sequence. | ||
Scan target sequence in different chain | - | Scan target sequence in different chain:
In direct chain only (default) In reverse chain only In both chains |
Similarly to normal distribution, the program can output either most frequent oligos or most rare ones. The following parameter is used for this: | ||
Frequency | - | Most frequent or least frequent:
most frequent (default) least frequent |
To determine which oligos must be output and which ones must not, the value for deviation multiplier range should be defined. | ||
Deviation multiplier is difference between number of oligos and expected number of oligos in sigma units. For more details see the algorithm description chapter. | ||
Deviation multiplier fence | - | Use the value 3.0 to output 5% of oligos. |
On oligo output, an additional filtering is made. For each oligo, the percentage of letters 'N' in relation to all letters of oligo is calculated. Oligos, for which this percentage does not exceed the "Percent of N" parameter, are output. | ||
Percent of N | - | Olig have no more # % of 'N', default is 100. |
Output file | - | Output file name. |
The "shift" parameter sets the value (in nucleotides) of shifting from the sequence start to the position from which oligos are to be generated. If there are several sequences in a file, the shift value affects each of them. The default value is 0. | ||
Shift in sequence | - | Shift in sequence, default value is 0. |
The "step" parameter sets the value (in nucleotides) of shifting for generating oligos. In order to get all oligos, this parameter should be set to 1, which is default value. | ||
Step in sequence | - | Step in sequence (default value is 1) |
Sometime it's necessary to check all three reading frames. To do this run the program three times with the following values for "shift" and "step":
1) step=3 shift=0 2) step=3 shift=1 3) step=3 shift=2 | ||
Input sequences may be either in FASTA format or in specially packed format. The "Softberry" products frequently used to pack large chromosomes into its own "nucfile" or nf format. Sequence file, in this case, has the .nf extension. | ||
If the "Packed file" parameter is not defined the program consider the input file as one in FASTA format. Otherwise the input file format is considered as "nucfile". | ||
Packed file | - | Input file is packed file (nucfile, nf). |
The FASTA file can be converted to the nucfile one using the cvtseq utility.
For example, to convert the FASTA file chr22.fa to the nucfile chr22.nf, use the following command string: cvtseq chr22.fa chr22.nf -fi -do -t "chr22" -n5gc Use the following command to check the information on a packed file: cvtseq chr22.nf -e Command output: filename: chr22.nf pack_mode: PACK_5 size: 49476972 from: 0 nonstandard: 1 title_size: 5 title: chr22 |
For each defined L the array that contains the number of oligos is built. The sequential number of oligo is used as an index for this array. The total number of oligos is a value of the array.
Further, using this array and defined parameters, program builds the table of oligos that contains more information (mean, deviation multiplier etc). This table is printed into output file.
Total number of all oligos - oligs_sum_count.
Total number of nucleotides - seqs_sum_length.
The oligo's frequency is a multiplication of frequencies of nucleotides it consists of.
The expected mean of the counter (that is equal to oligo's mean) is calculated by the following way:
average= oligs_sum_count*frequence;
Deviation is calculated with use of formula:
deviation = sqrt( oligs_sum_count*frequence*(1-frequence) );
The oligo's counter - olig_count - describes how much times this oligo occurs in a sequence.
Deviation multiplier is calculated with use of formula:
Deviation_multiplier= (olig_count-average)/deviation;
Normalized deviation (norm deviate) of the given oligo is calculated with use of formula:
Norm_deviate= olig_count/seqs_sum_length;
Example for program output:
Oligsr 1.4 Copyright (c) 2005-2006 Softberry Num seqs=32 Nucleotides=46705 Average seq length=1459.5 A=25.1% C=24.7% G=24.8% T=25.4% AC=49.8% AG=49.9% AT=50.5% CG=49.5% CT=50.1% GT=50.2% ACG=74.6% ACT=75.2% AGT=75.3% CGT=74.9% N=100.0% Output most frequent oligs, direction=direct, deviation multiplier=10.000000, no more 50.0 % of 'N' #olig,total olig counter,expected number,deviation,deviation multiplier,unique sequences counter,norm deviate Length 1 Length 2 TK 6906 5952.4 72.1 13.2 32 0.147864 TG 3544 2939.8 52.5 11.5 32 0.075881 MA 6654 5834.8 71.5 11.5 32 0.142469 GC 3409 2858.0 51.8 10.6 32 0.072990 Length 3 TKB 5574 4455.2 63.5 17.6 32 0.119345 VMA 5390 4349.4 62.8 16.6 32 0.115405 TKS 3731 2943.9 52.5 15.0 32 0.079884 YTK 3772 2980.5 52.8 15.0 32 0.080762 TGS 1993 1453.9 37.5 14.4 32 0.042672 TBB 7724 6647.3 75.5 14.3 32 0.165378 VMW 9944 8751.5 84.3 14.1 32 0.212911 MMA 3639 2903.8 52.2 14.1 32 0.077915 MAR 3639 2909.2 52.2 14.0 32 0.077915 VVA 7555 6514.6 74.9 13.9 32 0.161760 WKB 10034 8857.0 84.7 13.9 32 0.214838 TKY 3711 2980.5 52.8 13.8 32 0.079456 BTK 5330 4455.2 63.5 13.8 32 0.114121 YTB 5315 4447.0 63.4 13.7 32 0.113799 HTK 5343 4473.6 63.6 13.7 32 0.114399 VAR 5214 4357.4 62.9 13.6 32 0.111637 TKK 3706 2986.0 52.9 13.6 32 0.079349 TGB 2820 2200.3 45.8 13.5 32 0.060379 GCH 2754 2148.0 45.3 13.4 32 0.058966 WGC 1942 1442.6 37.4 13.4 32 0.041580 TKN 6904 5948.3 72.0 13.3 32 0.147821 NTK 6901 5948.3 72.0 13.2 32 0.147757 CWG 1936 1442.6 37.4 13.2 32 0.041452 GCW 1936 1442.6 37.4 13.2 32 0.041452 YKB 9894 8786.4 84.4 13.1 32 0.211840 RMA 3590 2909.2 52.2 13.0 32 0.076865 MAV 5157 4349.4 62.8 12.9 32 0.110416 RMW 6771 5853.7 71.5 12.8 32 0.144974 SMA 3551 2885.7 52.0 12.8 32 0.076030 WKS 6767 5852.6 71.5 12.8 32 0.144888 SCW 3540 2879.8 52.0 12.7 32 0.075795 YKS 6708 5806.0 71.3 12.7 32 0.143625 SWG 3548 2890.5 52.1 12.6 32 0.075966 MAA 1937 1463.7 37.7 12.6 32 0.041473 WGS 3545 2890.5 52.1 12.6 32 0.075902 VMR 9694 8645.0 83.9 12.5 32 0.207558 TBS 5180 4392.5 63.1 12.5 32 0.110909 DGC 2716 2150.6 45.3 12.5 32 0.058152 TGC 1057 725.6 26.7 12.4 32 0.022631 VMD 14248 13047.1 96.9 12.4 32 0.305064 HKS 9744 8714.7 84.2 12.2 32 0.208629 SCA 1886 1431.2 37.2 12.2 32 0.040381 YTG 1932 1472.0 37.8 12.2 32 0.041366 BTG 2755 2200.3 45.8 12.1 32 0.058987 TBY 5213 4447.0 63.4 12.1 32 0.111615 HTB 7583 6674.9 75.6 12.0 32 0.162359 HKB 14354 13188.3 97.3 12.0 32 0.307333 VWG 5106 4356.5 62.8 11.9 32 0.109324 SMW 6654 5806.4 71.3 11.9 32 0.142469 AAA 1058 737.7 26.9 11.9 32 0.022653 VAD 7463 6576.3 75.2 11.8 32 0.159790 MAD 5129 4390.6 63.1 11.7 32 0.109817 SMD 9638 8656.5 84.0 11.7 32 0.206359 VAA 2723 2192.3 45.7 11.6 32 0.058302 TGN 3542 2937.8 52.5 11.5 32 0.075838 NTG 3542 2937.8 52.5 11.5 32 0.075838 TGV 2715 2191.4 45.7 11.5 32 0.058131 NMA 6648 5830.8 71.4 11.4 32 0.142340 MAN 6647 5830.8 71.4 11.4 32 0.142319 KSC 3450 2862.0 51.8 11.3 32 0.073868 TTK 1943 1511.3 38.2 11.3 32 0.041602 CWS 3466 2879.8 52.0 11.3 32 0.074210 SMR 6535 5735.8 70.9 11.3 32 0.139921 VCA 2667 2157.1 45.4 11.2 32 0.057103 MWG 3494 2908.6 52.2 11.2 32 0.074810 HTG 2719 2209.4 45.9 11.1 32 0.058216 RVA 5055 4357.4 62.9 11.1 32 0.108233 MVA 5045 4349.4 62.8 11.1 32 0.108018 KSH 9645 8714.7 84.2 11.1 32 0.206509 WKY 6717 5925.3 71.9 11.0 32 0.143818 SVA 5010 4322.3 62.6 11.0 32 0.107269 GMW 3481 2908.6 52.2 11.0 32 0.074532 TSC 1858 1448.5 37.5 10.9 32 0.039782 TGY 1884 1472.0 37.8 10.9 32 0.040338 TTB 2754 2254.9 46.3 10.8 32 0.058966 HGC 2632 2148.0 45.3 10.7 32 0.056354 KSY 6568 5806.0 71.3 10.7 32 0.140627 KGC 1831 1433.7 37.3 10.7 32 0.039204 GCN 3407 2856.1 51.8 10.6 32 0.072947 KSM 6527 5770.7 71.1 10.6 32 0.139749 NGC 3406 2856.1 51.8 10.6 32 0.072926 KBB 14164 13133.9 97.1 10.6 32 0.303265 TKC 1868 1469.2 37.7 10.6 32 0.039996 MAM 3455 2903.8 52.2 10.6 32 0.073975 CTG 1005 725.6 26.7 10.5 32 0.021518 KBY 9669 8786.4 84.4 10.5 32 0.207023 TBC 2669 2192.1 45.7 10.4 32 0.057146 VVM 13931 12924.7 96.7 10.4 32 0.298276 VWK 9698 8821.1 84.6 10.4 32 0.207644 TSS 3442 2902.5 52.2 10.3 32 0.073697 TKG 1863 1474.7 37.8 10.3 32 0.039889 VAV 7283 6514.6 74.9 10.3 32 0.155936 MMR 6501 5771.8 71.1 10.3 32 0.139193 YTS 3475 2938.5 52.5 10.2 32 0.074403 DSC 4930 4293.3 62.4 10.2 32 0.105556 BTB 7412 6647.3 75.5 10.1 32 0.158698 WGB 5012 4374.3 63.0 10.1 32 0.107312 CWK 3450 2920.9 52.3 10.1 32 0.073868 WKC 3450 2920.9 52.3 10.1 32 0.073868 VCW 4972 4340.4 62.7 10.1 32 0.106455 RAA 1844 1466.4 37.7 10.0 32 0.039482 VHD 20770 19703.0 106.7 10.0 32 0.444706
The program version and name are shown in the first string:
Oligsr 1.4 Copyright (c) 2005-2006 Softberry
Num seqs=32 Nucleotides=46705 Average seq length=1459.5 A=25.1% C=24.7% G=24.8% T=25.4% AC=49.8% AG=49.9% AT=50.5% CG=49.5% CT=50.1% GT=50.2% ACG=74.6% ACT=75.2% AGT=75.3% CGT=74.9% N=100.0%
Further there is an information on input file:
Number of fasta-sequences - 32
Number of nucleotides - 46705
Average length of sequence - 1459.
Percentage of letters 'A' - 25.1
Percentage of letters 'C' - 24.7
Percentage of letters 'G' - 24.8
Percentage of letters 'T' - 25.4
Percentage of letters 'A or C' - 49.8
Percentage of letters 'A or G' - 49.9
Percentage of letters 'A or T' - 50.5
Percentage of letters 'C or G' - 49.5
Percentage of letters 'C or T' - 50.1
Percentage of letters 'G or T' - 50.2
Percentage of letters 'A or Ñ or G' - 74.6
Percentage of letters 'A or Ñ or T' - 75.2
Percentage of letters 'A or G or T' - 75.3
Percentage of letters 'C or G or T' - 74.9
Percentage of letters 'A or C or G or T' - 100.0
Output most frequent oligs, direction=direct, deviation multiplier=10.000000, no more 50.0 % of 'N'
Further there are defined input parameters:
To output the most frequent oligos - Output most frequent oligs.
To process the direct chain only - direction=direct
Defined range for deviation multiplier - 10.0
To output oligos containing not more than 50% of letters 'N'.
#olig, total olig counter, expected number, deviation, deviation multiplier, unique sequences counter, norm deviate
Further there is the hint on table of oligos by columns:
1 column -certain oligo (olig)
2 column - counter for current oligo, i.e. how many times this oligo occurs (total olig counter)
3 column - expected counter mean, i.e. an expected average number of oligos (expected number)
4 column - deviation of current oligo (deviation)
5 column -deviation multiplier value for current oligo (deviation multiplier)
To remind, in given example the range for deviation multiplier was set to 3.0. And since the option to output the most rare oligos was selected, the values in 5th column will be less or equal to -3.0.
6 column - number of sequences, in which this oligo occurs.
7 column - normalized deviation of this oligo.
For more details on values calculation see the chapter "Algorithm"
Length 3
Further there are tables of oligos with various length values.
Hereafter is an example of the table with oligos of length 3.
The length of examined oligo (Length 3) is shown.
TKB 5574 4455.2 63.5 17.6 32 0.119345 VMA 5390 4349.4 62.8 16.6 32 0.115405 TKS 3731 2943.9 52.5 15.0 32 0.079884
Further there is a table sorted by 5th column descend.
If the option to output the most frequent oligos is on, the table will be sorted by 5th column ascend.
Description of values in columns is above in the text.
The first string description:
1 column - certain oligo 'TKB'
2 column - counter for current oligo 5574
3 column - expected mean for oligo 4455.2
4 column - deviation of current oligo 63.5
5 column - deviation multiplier value for current oligo -17.6
6 column - number of sequences, in which this oligo occurs 32
7 column - normalized deviation of this oligo 0.119345