Program for mapping low complexity regions in nucleotide sequences.

Algorithm description

Search for the low complexity regions is performed with using Shannon's information measure. Shannon's information is defiened as follows:


where: {a1, ..., ak} is the alphabet of the size k, and P(ai) is a fractional composition of ai

The search is carried out as follows. For each position i of the sequence S calculation of the Shannon's information H(i, l) is performed in the window of size l within the range [lbegin, lend]. If H(i, l) turns out below prespecified threshold Hthr(l) then fragment [i, i+l] is declared low complex. Intersection of all such fragments at the end of calculation gives a map of low complexity regions of the sequence S.

Output examples


>c20
Masked regions:
p1:  90       p2:  115      l: 26        chain(+) [Low Complexity Region]
p1: 220       p2:  240      l: 23        chain(+) [Low Complexity Region]
....


>c20
GCCAAGAAGATATGTAGCATTAAGGTTTAGAATACAGGCTTTGAAGTCAAACAGACCAGAGTTAACAACCTCATTTTGTT
TTTATTTTCNNNNNNNNNNNNNNNNNNNNNNNNNNCTTTAAGTTCTAGGGTACATGTGCACAACGTGCAGGTTTGTTACA
TATGTATACATGTGCCATGTTGGTGTGCTGCACCCATTAACTGGACATTTACATTAGGTNNNNNNNNNNNNNNNNNNNNN
CCCTCCTCCCCTTACCCCACAACAGGCCCCGGTGTGTGATGTTCCCCTTCCTGTGTCCAAGTGTTCTCATTGTTCAGTTC
....


>c20
gccaagaagatatgtagcattaaggtttagaatacaggctttgaagtcaaacagaccagagttaacaacctcattttgtt
tttattttcTTTTTTAAAATTTTTTTAAAATTATActttaagttctagggtacatgtgcacaacgtgcaggtttgttaca
tatgtatacatgtgccatgttggtgtgctgcacccattaactggacatttacattaggtAAAAAAAAAAAAAAAAAAAAA
ccctcctccccttaccccacaacaggccccggtgtgtgatgttccccttcctgtgtccaagtgttctcattgttcagttc
....