|
ABSplit |
Program determines for the nucleotide sequence of approx. 300-600 n.p. whether it belongs to archeal or bacterial genome.
To classify the sequences linear discriminant analysis approach is used. Each sequence is represented by number of statistical parameters: mono- di- tri- nucleotide frequencies, and linear correlation coefficients (2 additional parameters) and mean absolute deviation (2 additional parameters) between the codon frequencies in the longest ORF found in the query sequence with the frequencies of codons in archaeal and bacterial genomes.
The training and testing data were taken from the sequences of the 157 genomes (21 archaeal and 136 bacterial). The length of sequences was 630. They were taken by splitting genomes to the sequences of this size, each 7-th fragment put in the testing set. There were 651612 fragments for training and 93008 fragments for testing data. The parameters for the linear discriminant function were obtained on the training set. The testing result in the following error estimates:
Number of sequences=93008 (class(A)=9158;class(B)=83850)
Archea(number/fraction)=18123/0.194854; mean_score=929428.413570
Bacteria(number/fraction)=74885/0.805146; mean_score=-1295582.386205
Test results:
Fraction of true predictions: 0.865141[80465]
Class 0: (Archea)
Fraction of true positives : 0.804652[7369]
Fraction of false negatives : 0.195348[1789]
Class 1: (Bacteria)
Fraction of true positives : 0.871747[73096]
Fraction of false negatives : 0.128253[10754]
LDF discrimination threshold=0.000000 Prediction results: Number of sequences=129 Arch(num/fract)=64/0.496124; mean_score=1173110.225735 Bact(num/fract)=65/0.503876; mean_score=-679245.160401 Histogram: 1 -1653112.270017 -1492294.115256 0.007752 2 -1492294.115256 -1331475.960496 0.015504 3 -1331475.960496 -1170657.805735 0.015504 4 -1170657.805735 -1009839.650974 0.038760 5 -1009839.650974 -849021.496214 0.069767 6 -849021.496214 -688203.341453 0.085271 7 -688203.341453 -527385.186693 0.093023 8 -527385.186693 -366567.031932 0.108527 9 -366567.031932 -205748.877172 0.023256 10 -205748.877172 -44930.722411 0.038760 11 -44930.722411 115887.432349 0.031008 12 115887.432349 276705.587110 0.054264 13 276705.587110 437523.741870 0.015504 14 437523.741870 598341.896631 0.023256 15 598341.896631 759160.051392 0.062016 16 759160.051392 919978.206152 0.023256 17 919978.206152 1080796.360913 0.015504 18 1080796.360913 1241614.515673 0.038760 19 1241614.515673 1402432.670434 0.046512 20 1402432.670434 1563266.457703 0.038760
Predicted archaeal sequences: >AB001339|seq56|1 ttagtcagggggccccgccgatgaaaccggggacagctactaaacccattgccagtggtgg tggtagctctggccctagtctgggctccggccaacccagagcagaacggcccggtggcggc aatgcaggggcaaatgttggtcccattgcggccaatcccgttgctagtagtgctcccccta aaccgaaaccaactcccagttcccccgctaagccagaccccttaaagtgcgttagccaatg taaacccagttatccctccatcctccagggggaagaaggtagtgctacagtattaatttca gtaaatgatagtggtggtgtgaccagcgtaaccatcaccaatgcccacggcaacagcgagg tcaaccgccaggccctattggcagccagaaaaatgcagtttacggcccccgccagtggtca atccaaatcagtccctgtggtgattcacttcaccgttgctggttcagactttgatcgtcag gcgagggagcgtcagcaacagcaggaagagttgcgtcaggccgcccgcagagcagaagagg aaaaggcaaatcaagcccgtcagagacagttggaagaggagcgtcaagcccgccaacggca attagagaaagaacgggaag >AB001339|seq128|1 aggcttccaagcaagcttcaattaaggatttttccagaaagggatcccccacctgcaccgc tgggcgatcgtccatggactgatccgttaactcagcactggcaaaactggctccccccatg ccatcccgtcccgtggtggaaccgacatataaaactggattgcctatcccagaagccccag ctttgacaatttcttccgtttccatcaaacccaaggccatggcgttgacgaggggattacc ggagtaagccggatcaaagtagatttccccgcccacagtgggcacaccaacacaattaccg taatgactgatcccatccactaccccggtgaaaatacgtcgattcctagcatcgtccaaat taccgaaccgtagggaatttaaaatggcgatcggcctcgctcccatggtgaaaatatcccg cagaatcccccctactccggtggcggctccctggaatggctccactgcggaaggatggtta tgggattcgattttaaacgccaatctcaggccatcccccaaatctacgaccccggcatttt ccccaggccccactaaaatgcgttctccttcggtgggaaagttactcagtaggggacggga atttttataacaacaatgtt >AB001339|seq184|1 attttcccgaagaaactacctccgatgcttggctgaccccagcagatgccggccaggatgg tgatgcccaggaaccggcggaagatgggggagaagaaggagtagtgtcggaagaactggcc ctgcctgaggacttacctcctatggatgccatggtggcggcagtggaagaaatgactccgg tggtggtgcccgaaactgtaccagaaacagaaaccccagccttagaggatttggtcgccca aaagaccgccctggaaaaggacattgccgctctgcaacgggaaaaagcccagtggtatggc cagcagttccagcaattacagcgggaaatggcccggttagtggaggaaggcaccagggaat tagggcaaagaaaagcagctctggaaaaggaaattgagaagttagagcgccgtcaggaacg gattcaacaggaaatgcgtaccacttttgccggggcttcccaggagttggccatccgcgtg cagggctttaaggattatttggtggggagtttgcaggatttggtttccgccgccgaccagt tggaattaggggtgggggacagttgggagtcttcctctacccatggggatgcgattattga aaatgccgacccaactccgg >AB001339|seq336|1 tctgccagctttgccattaatttccgcctcgatcccaccgaggtcgttaccattcgccgca cccaaggcacgttacaaaatattgtcgccaagattattgctccccaaacccaggaatcttt taaaattgccgccgcgcgacgcacagtggaagaagccatcaccaaacggagcgagttgaag gaagactttgataacgcccttaattcccgcctggagaaatacggcatcattgttctggaca ccagtgtggtggatttagccttctcccccgaatttgccaaggcggtggaggaaaaacaaat tgctgagcagagagcccagcgggcagtgtatgtggcccaggaagcggaacaacaggcccag gcggacatcaaccgagccaaggggaaggcagaagcccaacggttactggcggaaactttaa aagctcaggggggggaattagtcctacaaaaagaggcgatcgaagcttggcgggaaggggg ggctcccatgcccaaggttttggtgatggggggagaaggcaaggggtctgcggttcccttt atgtttaacctaactgacctggctaactagcggcagcggggaagttataggtcccagggct cctgcctgacctttaggtcc … Predicted bacterial sequences: >AB001339|seq8|1 ctgttacgtgttttgttgcaaacggaactttttgcagtagttagctccgttgttgccgata ccagtcaatggtatttttcaatccttcccgcaagctcacctgggcttcaaacccaaattct gctttagctttggtggtgtctaaacagcgacggggctggccgttgggttgatcggtttccc aaataatgtccccctcaaactccatcagttcacagattaattccgttaagtctttgatgga aatttcaaaattggtgcctaggttaaccggatcggctttgtcgtaggcttgggttcccatc acaatgccccgggccgcatcagtggagtaaagaaattccctggtgggactgccgtcgcccc aaacgggtaattgtttttgtccagctttttgcgcttcgtaaaccttatggatcaaggcagg aatcacgtgggaactgcggggatcgaagttatcttctgggccgtaaagatttactggcaag aggtaaatgccattaaagccatactgcaagcggtaggattccagttgcaccaacaatgctt tcttggccacgccgtagggagcgttggtttcttcaggataaccgttccataagtcttcttc cttaaagggtacaggggtaa >AB001339|seq24|1 cctttttttatttatcttgcccgctcccaaattaaataatcaaacctaacgggtcaactcc aaagacaacccaaggccattccaggctaattgattgaatcccgaattttattaactgtttg ttccatttgtgccatgtttgcccctcgaccttggattgtggtccgtctccggtctttaccc ctatcgtttcgcctcgatcgccatgtccccttggtaatgggattacttactgctctagcat tattactatttattctcaatattagttggggggaatatcctgtccctcccttggcgatgct ccaggccatctttgggctatctaccgatgctgaccatgaatttgtggtgcgtactctgcga ttaccccggtccttggtggcattgttggtgggtatgggtttggcgatcgccggagggattt tgcaaggcattacccgcaatcctttggcagcccctgaaattattggtgtcaatgcgggggc tagtttggtggcggttaccttcatcgttttgctaccgggtatttctccttccttgctgcca gtggccgctttttgcggtggtttaacagcggcgatcgccatttatgtgctggcttggaatc agggcagtgcccccgtccgg >AB001339|seq32|1 atgatgttgattactcctccagtggcaccatccccgtaaatggccgttggcccctggatca cttcaatccgttcaatggcactgggagcaatggtttgcaaatctcggaaggcattacggtt ggtggtttggggcacaccgtcaatcaaaaccaaaacgttacgtcctcgcaaagcctggcca aattgactggcactcccggtgctgggggctaagcctggcactagttgacccaaaatatccg ccaaggaagagtaaacctgggtttgttgctcaatttctgcccgttcaattaccgttaccga ccggggaatgttagcgatttcctcctctgtacgggtggcggaaaccacaatttgtagggcc tcactttcctctatctcggcggttgtcccggcaacccctggtcgaatcagcaattgtaacc cttgcgagttaggctttacttcggcttccggtggcccatttacccccgtgatagctaagcg cacttggttatcggtcatttgggtaacactgacaaacgcaatgtccgcagtggggctcact tcttcaaacccctggcccccaggtaaggccatcaaagtattgggaagatcaataattaagg cattgcccaccgtttgtagg >AB001339|seq64|1 ccgtccccgtcttaccggtaaagtatttgagaattagttgcagttaaggttgttcctcctg tgttatcagatgccatggccggctgtctcaactaagaatttcaagctttggtgcaaggagt gattatgaatcaagtacagtggtcggttttgttgatgggtatagtttcgctactatgtgct cccagggcgtgggccgaaactaatccgaaccaattgaacaggacgaatattttagaatctg gtaacttagaacgcaccaaagccggtgatttgctcccagttgcaaccactgttgatgagtg gataacccaaattgcccaagcttcgatcatcgaaatcaaggaagcccggatcaatttgacc gaagctggactggaactgaccctggctaccacgggccgcttatcaacaccaaccacttccg tagtgggcaatgcactaattgtagatattcccaatgccatcctagccttgccggatagtga cggactgcaacaggaaaaccccaccgaagaaattgccctagtgagcgttacagcattacct gataatattgttcgcattgccattaccggggtcaatgtgccgccgacggttgaagttaatg ccacagaccaatccctggta …