|
Oligs |
The program makes statistical calculations on oligonucleotides (4-nucleotides ) and shows the ones of significant differences to expected mean.
The input file should be in FASTA format and may contain several sequences. Alphabet. The allowed symbols: "ACGTUacgtu" and "NnyYrRBbDdHhKkWwSsMmVv". The symbols to be skipped: "0123456789; \n\r\t\0-". All other symbols are not allowed.
The program processes all oligonucleotides of length L. The L value runs all values in L1 to L2 range. | ||
Minimal olig length (L1) | - | Minimal olig length |
Minimal olig length (L2) | - | Minimal olig length |
Restrictions for L1, L2: 1<=L1 && L1<=L2 && L2<=13. | ||
Computer must have enough memory installed, and the memory size depends on oligo's length. | ||
Input file | - | Input file in FASTA-format. |
The special mode to print all oligos ignoring any additional conditions. While in this mode the very big output file can be generated. | ||
Print all oligs | - | Print all oligs, ignore conditions |
The program can process not only the given sequence but simultaneously build and process the reverse sequence. | ||
Scan target sequence in different chain | - | Scan target sequence in different chain:
In direct chain only (default) In reverse chain only In both chains |
Similarly to normal distribution, the program can output either most frequent oligos or most rare ones. The following parameter is used for this: | ||
Frequency | - | Most frequent or least frequent:
most frequent (default) least frequent |
To determine which oligos must be output and which ones must not, the value for deviation multiplier range should be defined. | ||
Deviation multiplier is difference between number of oligos and expected number of oligos in sigma units. For more details see the algorithm description chapter. | ||
Deviation multiplier fence | - | Use the value 3.0 to output 5% of oligos. |
Output file | - | Output file name. |
The "shift" parameter sets the value (in nucleotides) of shifting from the sequence start to the position from which oligos are to be generated. If there are several sequences in a file, the shift value affects each of them. The default value is 0. | ||
Shift in sequence | - | Shift in sequence, default value is 0. |
The "step" parameter sets the value (in nucleotides) of shifting for generating oligos. In order to get all oligos, this parameter should be set to 1, which is default value. | ||
Step in sequence | - | Step in sequence (default value is 1) |
Sometime it's necessary to check all three reading frames. To do this run the program three times with the following values for "shift" and "step":
1) step=3 shift=0 2) step=3 shift=1 3) step=3 shift=2 | ||
Input sequences may be either in FASTA format or in specially packed format. The "Softberry" products frequently used to pack large chromosomes into its own "nucfile" or nf format. Sequence file, in this case, has the .nf extension. | ||
If the "Packed file" parameter is not defined the program consider the input file as one in FASTA format. Otherwise the input file format is considered as "nucfile". | ||
Packed file | - | Input file is packed file (nucfile, nf). |
The FASTA file can be converted to the nucfile one using the cvtseq utility.
For example, to convert the FASTA file chr22.fa to the nucfile chr22.nf, use the following command string: cvtseq chr22.fa chr22.nf -fi -do -t "chr22" -n5gc Use the following command to check the information on a packed file: cvtseq chr22.nf -e Command output: filename: chr22.nf pack_mode: PACK_5 size: 49476972 from: 0 nonstandard: 1 title_size: 5 title: chr22 |
For each defined L the array that contains the number of oligos is built. The sequential
number of oligo is used as an index for this array. The total number of oligos is a value of the array.
Further, using this array and defined parameters, program builds the table of oligos that contains
more information (mean, deviation multiplier etc). This table is printed into output file.
Total number of all oligos - oligs_sum_count.
Total number of nucleotides - seqs_sum_length.
The oligo's frequency is a multiplication of frequencies of nucleotides it consists of.
The expected mean of the counter (that is equal to oligo's mean) is calculated by the following way:
average= oligs_sum_count*frequence;
Deviation is calculated with use of formula:
deviation = sqrt( oligs_sum_count*frequence*(1-frequence) );
The oligo's counter - olig_count - describes how much times this oligo occurs in a sequence.
Deviation multiplier is calculated with use of formula:
Deviation_multiplier= (olig_count-average)/deviation;
Normalized deviation (norm deviate) of the given oligo is calculated with use of formula:
Norm_deviate= olig_count/seqs_sum_length;
Example for program output: Oligs 1.6 Copyright (c) 2005-2006 Softberry Num seqs=32 Nucleotides=46705 Average seq length=1459.5 A=25.1% C=24.7% G=24.8% T=25.4% N=0.000000% Other=0.000000% Output least frequent oligs, direction=direct, seq_shift=0, seq_step=1 deviation multiplier=3.000000 #olig,total olig counter,expected number,deviation,deviation multiplier,unique sequences counter,norm deviate Length 2 oligs=46673 TA 2174 2976.6 52.8 -15.2 32 0.046547 CG 2461 2858.0 51.8 -7.7 32 0.052692 GT 2609 2939.8 52.5 -6.3 32 0.055861 AC 2579 2893.8 52.1 -6.0 32 0.055219 GG 2662 2868.7 51.9 -4.0 32 0.056996 Length 3 oligs=46641 TAG 412 737.4 26.9 -12.1 32 0.008821 CTA 446 734.7 26.9 -10.7 32 0.009549 GTA 511 737.4 26.9 -8.4 32 0.010941 TAC 509 734.7 26.9 -8.4 31 0.010898 CGT 519 725.6 26.7 -7.7 32 0.011112 GGG 508 710.7 26.5 -7.7 32 0.010877 GTC 539 725.6 26.7 -7.0 32 0.011541 ACG 549 716.9 26.6 -6.3 32 0.011755 GAC 551 716.9 26.6 -6.2 32 0.011797 CCC 545 702.8 26.3 -6.0 32 0.011669 CGG 550 708.1 26.4 -6.0 32 0.011776 TTA 608 755.7 27.3 -5.4 32 0.013018 ATA 607 746.7 27.1 -5.2 31 0.012996 TAT 626 755.7 27.3 -4.8 32 0.013403 ACC 595 714.3 26.5 -4.5 32 0.012740 TAA 627 746.7 27.1 -4.4 32 0.013425 GGT 619 728.3 26.8 -4.1 32 0.013253 TCA 631 734.7 26.9 -3.9 32 0.013510 AGT 640 737.4 26.9 -3.6 32 0.013703 CCG 611 705.4 26.4 -3.6 32 0.013082 ACT 651 734.7 26.9 -3.1 32 0.013939 Length 4 oligs=46609 CTAG 73 182.0 13.5 -8.1 26 0.001563 GGGG 71 176.1 13.2 -7.9 24 0.001520 TAGG 83 182.7 13.5 -7.4 24 0.001777 CCTA 85 181.3 13.4 -7.2 26 0.001820 CGTA 92 182.0 13.5 -6.7 26 0.001970 TAGT 104 187.2 13.7 -6.1 26 0.002227 TTAG 105 187.2 13.7 -6.0 25 0.002248 ACGT 101 182.0 13.5 -6.0 29 0.002163 TACG 104 182.0 13.5 -5.8 22 0.002227 TAGA 108 185.0 13.6 -5.7 27 0.002312 TCTA 111 186.5 13.6 -5.5 27 0.002377 GGTA 110 182.7 13.5 -5.4 24 0.002355 ACTA 112 184.3 13.5 -5.3 29 0.002398 ACCC 106 176.3 13.3 -5.3 26 0.002270 GTCA 111 182.0 13.5 -5.3 26 0.002377 TAAC 113 184.3 13.5 -5.3 29 0.002419 CTAT 115 186.5 13.6 -5.2 29 0.002462 ATAG 115 185.0 13.6 -5.2 26 0.002462 CGGT 111 179.8 13.4 -5.1 30 0.002377 CGTC 111 179.1 13.4 -5.1 29 0.002377 CGGG 109 175.4 13.2 -5.0 29 0.002334 GATA 118 185.0 13.6 -4.9 27 0.002526 TATC 120 186.5 13.6 -4.9 30 0.002569 TACC 116 181.3 13.4 -4.9 26 0.002484 TAGC 117 182.0 13.5 -4.8 27 0.002505 TTAC 121 186.5 13.6 -4.8 28 0.002591 GTAG 119 182.7 13.5 -4.7 28 0.002548 ATAC 123 184.3 13.5 -4.5 26 0.002634 GGGT 121 180.4 13.4 -4.4 26 0.002591 CCCT 120 178.4 13.3 -4.4 29 0.002569 CGCG 117 174.8 13.2 -4.4 26 0.002505 GGTC 122 179.8 13.4 -4.3 29 0.002612 CTAA 126 184.3 13.5 -4.3 31 0.002698 GACC 120 177.0 13.3 -4.3 27 0.002569 TAAG 127 185.0 13.6 -4.3 30 0.002719 GTCT 127 184.2 13.5 -4.2 30 0.002719 CTTA 129 186.5 13.6 -4.2 31 0.002762 GTAA 128 185.0 13.6 -4.2 28 0.002741 ACGG 122 177.6 13.3 -4.2 30 0.002612 GACT 126 182.0 13.5 -4.2 31 0.002698 TCAT 130 186.5 13.6 -4.1 29 0.002783 AGAC 125 179.8 13.4 -4.1 28 0.002676 GTAT 132 187.2 13.7 -4.0 25 0.002826 CCCG 121 174.1 13.2 -4.0 28 0.002591 TACT 132 186.5 13.6 -4.0 29 0.002826 TGAC 129 182.0 13.5 -3.9 30 0.002762 CCGG 123 174.8 13.2 -3.9 27 0.002634 ACCG 125 177.0 13.3 -3.9 29 0.002676 ATTA 136 189.6 13.7 -3.9 29 0.002912 CCCC 123 173.5 13.1 -3.8 25 0.002634 AGTC 132 182.0 13.5 -3.7 26 0.002826 GTAC 132 182.0 13.5 -3.7 26 0.002826 CTAC 132 181.3 13.4 -3.7 31 0.002826 TCAC 132 181.3 13.4 -3.7 30 0.002826 CATA 135 184.3 13.5 -3.6 27 0.002890 AGTA 137 185.0 13.6 -3.5 29 0.002933 GCGT 136 179.8 13.4 -3.3 29 0.002912 GCTA 138 182.0 13.5 -3.3 28 0.002955 TCGT 140 184.2 13.5 -3.3 31 0.002998 GTTA 143 187.2 13.7 -3.2 29 0.003062 GAGT 140 182.7 13.5 -3.2 29 0.002998 TCGG 138 179.8 13.4 -3.1 31 0.002955
The program version and name are shown in the first string:
Oligs 1.6 Copyright (c) 2005-2006 Softberry Num seqs=32 Nucleotides=46705 Average seq length=1459.5 A=25.1% C=24.7% G=24.8% T=25.4% N=0.000000% Other=0.000000%
Further there is an information on input file:
Number of fasta-sequences - 32
Number of nucleotides - 46705
Average length of sequence - 1459.5
Percentage of 'A' - 25.1
Percentage of 'C' - 24.7
Percentage of 'G' - 24.8
Percentage of 'T' - 25.4
Percentage of 'N' - 0.0
Percentage of other letters (except A,C,G,T,N ) - 0.0
Output least frequent oligs, direction=direct, seq_shift=0, seq_step=1 deviation multiplier=3.000000
Further there are defined input parameters:
To show the most rare oligos - Output least frequent oligs.
Process the direct chain only - direction=direct
The "Shift" parameter - 0
The "Step" parameter - 1
Defined deviation multiplier range - 3.0
#olig,total olig counter,expected number,deviation,deviation multiplier,unique sequences counter,norm deviate
Further there is a hint for table of oligos on each column:
1 column - the specific oligo (olig)
2 column - the counter of this oligo, i.e. how much times this oligo occurs (total olig counter)
3 column - the expected counter mean value, i.e. expected average number of oligos (expected number)
4 column - the deviation of the current oligo (deviation)
5 column - the value of deviation multiplier for the current oligo (deviation multiplier)
Note that in this example the value for deviation multipler range was set to 3.0. And since the mode to output the rarest oligos was chosen, the values in 5 column will be less or equal to -3.0.
6 column - the number of sequences containing the current oligo (unique sequences counter).
7 column - normalized deviation of the current oligo (norm deviate).
For more details on how various values are calculated see chapter "algorithm".
Length 3 oligs=46641
Further there are tables of oligos of different length.
Example for table of oligos of length 3
Here the length of the current oligo (Length 3) and total number of oligos of this length (oligs=46641) are shown.
TAG 412 737.4 26.9 -12.1 32 0.008821 CTA 446 734.7 26.9 -10.7 32 0.009549 GTA 511 737.4 26.9 -8.4 32 0.010941
Further there is the table with 5 column's values sorted by descending.
If it will be chosen the parameter to output the most frequent oligos, the values in 5 column will be sorted by ascending.
Description of values is shown earlier in the text.
The first string description.
1 column - The current oligo 'TAG'
2 column - The counter of the current oligo is 412
3 column - The expected oligo's mean is 737.4
4 column - The deviation for the current oligo is 26.9
5 column - The value for deviation multiplier for the current oligo is -12.1
6 column - The total number of sequences containing the current oligo is 32
7 column - Normalized deviation is 0.008821