ABSplit

Program determines for the nucleotide sequence of approx. 300-600 n.p. whether it belongs to archeal or bacterial genome.

To classify the sequences linear discriminant analysis approach is used. Each sequence is represented by number of statistical parameters: mono- di- tri- nucleotide frequencies, and linear correlation coefficients (2 additional parameters) and mean absolute deviation (2 additional parameters) between the codon frequencies in the longest ORF found in the query sequence with the frequencies of codons in archaeal and bacterial genomes.

The training and testing data were taken from the sequences of the 157 genomes (21 archaeal and 136 bacterial). The length of sequences was 630. They were taken by splitting genomes to the sequences of this size, each 7-th fragment put in the testing set. There were 651612 fragments for training and 93008 fragments for testing data. The parameters for the linear discriminant function were obtained on the training set. The testing result in the following error estimates:

Number of sequences=93008 (class(A)=9158;class(B)=83850)
Archea(number/fraction)=18123/0.194854; mean_score=929428.413570
Bacteria(number/fraction)=74885/0.805146; mean_score=-1295582.386205
Test results:
Fraction of true predictions: 0.865141[80465]
Class 0: (Archea)
Fraction of true positives : 0.804652[7369]
Fraction of false negatives : 0.195348[1789]
Class 1: (Bacteria)
Fraction of true positives : 0.871747[73096]
Fraction of false negatives : 0.128253[10754]

OUTPUT EXAMPLE

LDF discrimination threshold=0.000000
Prediction results:
Number of sequences=129 
Arch(num/fract)=64/0.496124; mean_score=1173110.225735
Bact(num/fract)=65/0.503876; mean_score=-679245.160401

Histogram:
1	-1653112.270017	-1492294.115256	0.007752
2	-1492294.115256	-1331475.960496	0.015504
3	-1331475.960496	-1170657.805735	0.015504
4	-1170657.805735	-1009839.650974	0.038760
5	-1009839.650974	 -849021.496214	0.069767
6	 -849021.496214	 -688203.341453	0.085271
7	 -688203.341453	 -527385.186693	0.093023
8	 -527385.186693	 -366567.031932	0.108527
9	 -366567.031932	 -205748.877172	0.023256
10	 -205748.877172	  -44930.722411	0.038760
11	  -44930.722411	  115887.432349	0.031008
12	  115887.432349	  276705.587110	0.054264
13	  276705.587110	  437523.741870	0.015504
14	  437523.741870	  598341.896631	0.023256
15	  598341.896631	  759160.051392	0.062016
16	  759160.051392	  919978.206152	0.023256
17	  919978.206152	 1080796.360913	0.015504
18	 1080796.360913	 1241614.515673	0.038760
19	 1241614.515673	 1402432.670434	0.046512
20	 1402432.670434	 1563266.457703	0.038760

Predicted archaeal sequences:
>AB001339|seq56|1
ttagtcagggggccccgccgatgaaaccggggacagctactaaacccattgccagtggtgg
tggtagctctggccctagtctgggctccggccaacccagagcagaacggcccggtggcggc
aatgcaggggcaaatgttggtcccattgcggccaatcccgttgctagtagtgctcccccta
aaccgaaaccaactcccagttcccccgctaagccagaccccttaaagtgcgttagccaatg
taaacccagttatccctccatcctccagggggaagaaggtagtgctacagtattaatttca
gtaaatgatagtggtggtgtgaccagcgtaaccatcaccaatgcccacggcaacagcgagg
tcaaccgccaggccctattggcagccagaaaaatgcagtttacggcccccgccagtggtca
atccaaatcagtccctgtggtgattcacttcaccgttgctggttcagactttgatcgtcag
gcgagggagcgtcagcaacagcaggaagagttgcgtcaggccgcccgcagagcagaagagg
aaaaggcaaatcaagcccgtcagagacagttggaagaggagcgtcaagcccgccaacggca
attagagaaagaacgggaag
>AB001339|seq128|1
aggcttccaagcaagcttcaattaaggatttttccagaaagggatcccccacctgcaccgc
tgggcgatcgtccatggactgatccgttaactcagcactggcaaaactggctccccccatg
ccatcccgtcccgtggtggaaccgacatataaaactggattgcctatcccagaagccccag
ctttgacaatttcttccgtttccatcaaacccaaggccatggcgttgacgaggggattacc
ggagtaagccggatcaaagtagatttccccgcccacagtgggcacaccaacacaattaccg
taatgactgatcccatccactaccccggtgaaaatacgtcgattcctagcatcgtccaaat
taccgaaccgtagggaatttaaaatggcgatcggcctcgctcccatggtgaaaatatcccg
cagaatcccccctactccggtggcggctccctggaatggctccactgcggaaggatggtta
tgggattcgattttaaacgccaatctcaggccatcccccaaatctacgaccccggcatttt
ccccaggccccactaaaatgcgttctccttcggtgggaaagttactcagtaggggacggga
atttttataacaacaatgtt
>AB001339|seq184|1
attttcccgaagaaactacctccgatgcttggctgaccccagcagatgccggccaggatgg
tgatgcccaggaaccggcggaagatgggggagaagaaggagtagtgtcggaagaactggcc
ctgcctgaggacttacctcctatggatgccatggtggcggcagtggaagaaatgactccgg
tggtggtgcccgaaactgtaccagaaacagaaaccccagccttagaggatttggtcgccca
aaagaccgccctggaaaaggacattgccgctctgcaacgggaaaaagcccagtggtatggc
cagcagttccagcaattacagcgggaaatggcccggttagtggaggaaggcaccagggaat
tagggcaaagaaaagcagctctggaaaaggaaattgagaagttagagcgccgtcaggaacg
gattcaacaggaaatgcgtaccacttttgccggggcttcccaggagttggccatccgcgtg
cagggctttaaggattatttggtggggagtttgcaggatttggtttccgccgccgaccagt
tggaattaggggtgggggacagttgggagtcttcctctacccatggggatgcgattattga
aaatgccgacccaactccgg
>AB001339|seq336|1
tctgccagctttgccattaatttccgcctcgatcccaccgaggtcgttaccattcgccgca
cccaaggcacgttacaaaatattgtcgccaagattattgctccccaaacccaggaatcttt
taaaattgccgccgcgcgacgcacagtggaagaagccatcaccaaacggagcgagttgaag
gaagactttgataacgcccttaattcccgcctggagaaatacggcatcattgttctggaca
ccagtgtggtggatttagccttctcccccgaatttgccaaggcggtggaggaaaaacaaat
tgctgagcagagagcccagcgggcagtgtatgtggcccaggaagcggaacaacaggcccag
gcggacatcaaccgagccaaggggaaggcagaagcccaacggttactggcggaaactttaa
aagctcaggggggggaattagtcctacaaaaagaggcgatcgaagcttggcgggaaggggg
ggctcccatgcccaaggttttggtgatggggggagaaggcaaggggtctgcggttcccttt
atgtttaacctaactgacctggctaactagcggcagcggggaagttataggtcccagggct
cctgcctgacctttaggtcc
…
Predicted bacterial sequences:

>AB001339|seq8|1
ctgttacgtgttttgttgcaaacggaactttttgcagtagttagctccgttgttgccgata
ccagtcaatggtatttttcaatccttcccgcaagctcacctgggcttcaaacccaaattct
gctttagctttggtggtgtctaaacagcgacggggctggccgttgggttgatcggtttccc
aaataatgtccccctcaaactccatcagttcacagattaattccgttaagtctttgatgga
aatttcaaaattggtgcctaggttaaccggatcggctttgtcgtaggcttgggttcccatc
acaatgccccgggccgcatcagtggagtaaagaaattccctggtgggactgccgtcgcccc
aaacgggtaattgtttttgtccagctttttgcgcttcgtaaaccttatggatcaaggcagg
aatcacgtgggaactgcggggatcgaagttatcttctgggccgtaaagatttactggcaag
aggtaaatgccattaaagccatactgcaagcggtaggattccagttgcaccaacaatgctt
tcttggccacgccgtagggagcgttggtttcttcaggataaccgttccataagtcttcttc
cttaaagggtacaggggtaa
>AB001339|seq24|1
cctttttttatttatcttgcccgctcccaaattaaataatcaaacctaacgggtcaactcc
aaagacaacccaaggccattccaggctaattgattgaatcccgaattttattaactgtttg
ttccatttgtgccatgtttgcccctcgaccttggattgtggtccgtctccggtctttaccc
ctatcgtttcgcctcgatcgccatgtccccttggtaatgggattacttactgctctagcat
tattactatttattctcaatattagttggggggaatatcctgtccctcccttggcgatgct
ccaggccatctttgggctatctaccgatgctgaccatgaatttgtggtgcgtactctgcga
ttaccccggtccttggtggcattgttggtgggtatgggtttggcgatcgccggagggattt
tgcaaggcattacccgcaatcctttggcagcccctgaaattattggtgtcaatgcgggggc
tagtttggtggcggttaccttcatcgttttgctaccgggtatttctccttccttgctgcca
gtggccgctttttgcggtggtttaacagcggcgatcgccatttatgtgctggcttggaatc
agggcagtgcccccgtccgg
>AB001339|seq32|1
atgatgttgattactcctccagtggcaccatccccgtaaatggccgttggcccctggatca
cttcaatccgttcaatggcactgggagcaatggtttgcaaatctcggaaggcattacggtt
ggtggtttggggcacaccgtcaatcaaaaccaaaacgttacgtcctcgcaaagcctggcca
aattgactggcactcccggtgctgggggctaagcctggcactagttgacccaaaatatccg
ccaaggaagagtaaacctgggtttgttgctcaatttctgcccgttcaattaccgttaccga
ccggggaatgttagcgatttcctcctctgtacgggtggcggaaaccacaatttgtagggcc
tcactttcctctatctcggcggttgtcccggcaacccctggtcgaatcagcaattgtaacc
cttgcgagttaggctttacttcggcttccggtggcccatttacccccgtgatagctaagcg
cacttggttatcggtcatttgggtaacactgacaaacgcaatgtccgcagtggggctcact
tcttcaaacccctggcccccaggtaaggccatcaaagtattgggaagatcaataattaagg
cattgcccaccgtttgtagg
>AB001339|seq64|1
ccgtccccgtcttaccggtaaagtatttgagaattagttgcagttaaggttgttcctcctg
tgttatcagatgccatggccggctgtctcaactaagaatttcaagctttggtgcaaggagt
gattatgaatcaagtacagtggtcggttttgttgatgggtatagtttcgctactatgtgct
cccagggcgtgggccgaaactaatccgaaccaattgaacaggacgaatattttagaatctg
gtaacttagaacgcaccaaagccggtgatttgctcccagttgcaaccactgttgatgagtg
gataacccaaattgcccaagcttcgatcatcgaaatcaaggaagcccggatcaatttgacc
gaagctggactggaactgaccctggctaccacgggccgcttatcaacaccaaccacttccg
tagtgggcaatgcactaattgtagatattcccaatgccatcctagccttgccggatagtga
cggactgcaacaggaaaaccccaccgaagaaattgccctagtgagcgttacagcattacct
gataatattgttcgcattgccattaccggggtcaatgtgccgccgacggttgaagttaatg
ccacagaccaatccctggta
…