KMerHuffman upon Biological Sequence Compression
DOI:
https://doi.org/10.26438/ijcse/v5i12.201206Keywords:
Storage, Transmission, Alignment, Analysis, Compression, HuffmanAbstract
Huge amount of genomic data are produced due to high-throughput sequencing technology. Those enormous volumes of sequence data require effective storage, fast transmission and provision of quick access for alignment and analysis to any record. It has been proved that standard general purpose lossless compression techniques failed to compress these sequences rather they may increase the size. But some general purpose compression method may be useful with a modification for genome compression. In this paper, a variation of statistical Huffman algorithm have been proposed named KMerHuffman, which instead of calculating frequency of individual character it comes as a substring of length four which we have experiment to be optimal due to redundancy of genome sequence. Then KMerHuffman result on benchmark sequence has been compare with the other biological sequence specific compression algorithm. The result shows that KMerHuffman is competitive with other method. Another important aspect is that there is no need of any reference sequence so it is useful for upcoming sequence.
References
[1] D.A. Huffman, “A method for the construction of minimum-redundancy codes”, Proc. Inst. Radio Eng., vol. 40, pp. 1098–1101, 1952.
[2] D.A. Lelewer, D.S. Hirschberg, “Data compression. Computing Surveys”, vol. 19, no. 3, pp. 261–296, 1987.
[3] A. Cannane, H.E. Williams, “General-Purpose Compression for Efficient Retrieval”, Journal of the American Society for Information Science & Technology, vol. 52, no. 5, pp. 430–437, 2001.
[4] Department of Chemistry, Queen Mary University of London, “Nomenclature for Incompletely Specified Bases in Nucleic Acid Sequences”.
[5] S. Wandelt, M. Bux, U. Leser, “Trends in Genome Compression”, June 4, 2013.
[6] P. R. Rajeswari and Dr. A. AppaRao, “DNABIT Compress – Genome compression algorithm”, Bioinformation,vol. 5 no. 8, (2011) January, pp. 350-360.
[7] S. Roy, A. Bhagat, K.A. Sharma, S. Khatua”, SBVRLDNACOMP: AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM”, IJCSA, Vol.5, No.4, pp. 73-85, August 2015.
[8] S. Roy, S. Mondal, S. Khatua, M. Biswas, “An Efficient Compression Algorithm for Forthcoming New Species”, IJHIT, Vol.8, No.11, pp.323-332, November 2015.
[9] J. Ziv and A. Lempel, “A Universal Algorithm for Sequential Data Compression,” IEEE Trans. Information Theory, vol. IT-23, no. 3, pp. 337-343, May 1977.
[10] B. G. Chern, I. Ochoa, A. Manolakos, A. No, K. Venkat and T. Weissman, “Reference Based Genome Compression”, Information Systems Laboratory, Stanford University.
[11] M.C. Brandon, D.C. Wallace, P. Baldi, “Data structures and compression algorithms for genomic sequence data”, Bioinformatics, Vol. 25, no. 14, pages 1731–1738, May, 2009.
[12] http://site.iugaza.edu.ps/jroumy/files/Shanon-Fano.pdf
[13] A.S.E. Campos, “Arithmetic coding “http://www.arturocampos.com/ac_arithmetic.html. (Accessed 02 February 2009)
[14] G. Cormack, N. Horspool. Data compression using dynamic markov modelling. Comput. J., vol. 30: pp. 541-550, 1987.
[15] Ziv, Jacob and Lempel, Abraham, "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory, Vol. 23, no. 3, pp. 337–343, May 1977.
[16] Ziv, Jacob and Lempel, Abraham, "Compression of Individual Sequences via Variable-Rate Coding", IEEE Transactions on Information Theory, vol. 24, no. 5, pp. 530–536, September 1978.
[17] S. Kuruppu, S.J. Puglisi, and J. Zobel, “Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval,” Proc. 17th Int’l Conf. String Processing and Information Retrieval (SPIRE ’10), pp. 201-206, 2010.
[18] S. Kuruppu, S. Puglisi, and J. Zobel, “Optimized Relative Lempel- Ziv Compression of Genomes”, Australasian Computer Science Conf., 2011.
[19] S. Deorowicz and S. Grabowski, “Robust Relative Compression of Genomes with Random Access”, Bioinformatics, vol. 27, pp. 2979- 2986, Nov. 2011.
[20] S. Kuruppu, B. Beresford-Smith, T. Conway, and J. Zobel, “Iterative Dictionary Construction for Compression of Large DNA Data Sets”, IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 9, no. 1, Jan./Feb. 2012.
[21] S. Grumbach and F. Tahi, “Compression of DNA sequences”, IEEE Symp. on the Data Compression Conf., DCC-93, Snowbird, UT, (1993), pp. 340–350.
[22] S. Grumbach and F. Tahi, “A new challenge for compression algorithms: genetic sequences”, Info. Process. & Manage, Elsevier, (1994), pp.875-866.
[23] X. Chen, S. Kwong, M. Li, “A compression algorithm for DNA sequences and its applications in genome comparison”, In Proc. 4th Annual Int. Conf. Computation. Molecular Biol. (RECOMB), pp.107 – 117, 2000.
[24] X. Chen, M. Li, B. Ma and J. Tromp, “DNACompress: Fast and Effective DNA Sequence Compression”, Bioinformatics, vol. 18, (2002) June, pp. 1696-1698.
[25] G. Korodi and I. Tabus, “An Efficient Normalized MaximumLikelihood Algorithm for DNA Sequence Compression”, ACM Trans. Information Systems, vol. 23, no. 1, pp. 3-34, 2005.
[26] J. Zhen, J. Zhou, L. Jiang, Q. H. Wu. Overview of DNA sequence data compression techniques. Acta Electronica Sinica, pp. 1113 – 1121, 2010.
[27] L. TAN, J. SUN, W. XIONG, “A Compression Algorithm for DNA Sequence Using Extended Operations”, Journal of Computational Information Systems, vol. 8, no. 18 pp. 7685–7691, 2012.
[28] The GeNML homepage: http://www.cs.tut.fi/~tabus/genml/results.html.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 S Roy, S Khatua

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
