Abstract—In this paper, we propose an efficient way of finding the exact distance in sequence comparison by using Huffman coding method for alphabets with uniform symbol probabilities. The approach is proposed as a refinement for word pair comparison in D2 statistics, though it can readily be generalised. Two given sequences with identical lengths are encoded to Huffman binary codes by which we are able to calculate Hamming Distance using binary operations efficiently. This method is applied on D2 statistics to compare
k-tuples faster than its original version. The evaluation on emprical sequences showed that the method is faster than original D2; especially, when re-using the encoded sequences which resulted in better performance.
Index Terms—Performance, experimentation, string and sequence processing, huffman encoding, hamming distance, D2 statistics, string comparison, binary codes, bitwise operation.
The author is with the Faculty of Science and Engineering, Queensland University of Technology, Australia (e-mail: H.Kamelrahimi@qut.edu.au, David.Poxon@qut.edu.au).
[PDF]
Cite: Hossein Kamel Rahimi, "Efficient Sequence Comparison Using Binary Codes," International Journal of Future Computer and Communication vol. 4, no. 3, pp. 165-169, 2015.