Reference Based Genomic Data Compression Using R Programming

Authors

  • Rani MS Dept.of Computer Science and Applications, The Gandhigram Rural Institute (Deemed to be University), Dindigul, India
  • Bose JC Dept.of Computer Science and Applications, The Gandhigram Rural Institute (Deemed to be University), Dindigul, India

Keywords:

FASTA file, Genomic data compression, R-Programming, Huffman coding, BIG DATA

Abstract

Genomics has become a hot research area in medical field for diagnosis of monogenetic disorder identification, pharmaco genetics, targeted therapy, genome editing and personalized medicine. Each human genome consists of 3 billion pairs which are to be effectively stored and transmitted for analysis. This process necessitates the development of novel genomic data compression algorithms. In this paper a referential based method for compressing genomes has been proposed. The input and reference genomes are compared for dissimilarities and further entropy coded to achieve high compression ratio

References

S. D. Kahn. “On the future of genomic data. Science (Washington)”,vol.331,pp.728–729, 2011.

J. K. Bonfield and M. V. Mahoney, “Compression of FASTQ and SAM format sequencing data”, PLoS ONE,vol.8, issue.3, 2013.

S. Deorowicz, A. Danek, and M. Niemiec. Gdc , “Compression of large collections of genomes”, arXiv preprint arXiv:1503.01624, 2015.

Y. Zhang, L. Li, Y. Yang, X. Yang, S. He, and Z. Zhu. “Light-weight reference-based compression of FASTQ data”, BMC bioinformatics, vol.16, issue.1, pp.188, 2015.

E. S. Lander, et al., “Initial sequencing and analysis of the human genome”, Nature, vol. 409, pp. 860-921, 2001.

S. Kuruppu, S. J. Puglisi and J. Zobel, “Optimized relative Lempel-Ziv compression of genomes”, Proceeding of ACSC 2011.

P.SubrahmanyaandT.Berger, “Asliding window Lempel-Zivalgorithm for differential layer encoding in progressive transmission”, Proc. IEEE Int. Symp. Inf. Theory, Whistler, BC, Canada, pp. 266,995, 1995.

Kwang Su Jung, Nam Hee Yu, Seung Jung Shin, Keun Ho Ryu, “A Compressing Method for Genome Sequence Cluster using Sequence Alignment”, 2008

M.Mary Shanthi Rani, “A New Referential Method for Compressing Genomes” International Journal of Computational Bioinformatics and In Silico Modeling, Research Article Open Access, Vol. 4, issue.1, pp.592-596 2015.

Biji CL and AchuthsankarS.Nair, “Benchmark dataset for Whole Genome sequence compression”, pp.1545-5963, 2016.

RabiaArshad,AdeelSaleem and Danista Khan, “ Performance Comparison of Huffman Coding and Double Huffman Coding ”,978-1-pp.5090-2000,2016.

Komal Sharma, Kunal Gupta, “Losseless Data Comperssion Techniques and Their Performance”, ieee, 2017.

Kakoli Banerjee and A.Prasad, “Reference Based Inter Chromosomal Similarity based DNA sequence Compression algorithm”, ,ieee, 2017

Downloads

Published

2025-11-13

How to Cite

[1]
M. M. S. Rani and S. J. C. Bose, “Reference Based Genomic Data Compression Using R Programming”, Int. J. Comp. Sci. Eng., vol. 6, no. 4, pp. 328–332, Nov. 2025.