Reference Based Genomic Data Compression Using R Programming
Keywords:
FASTA file, Genomic data compression, R-Programming, Huffman coding, BIG DATAAbstract
Genomics has become a hot research area in medical field for diagnosis of monogenetic disorder identification, pharmaco genetics, targeted therapy, genome editing and personalized medicine. Each human genome consists of 3 billion pairs which are to be effectively stored and transmitted for analysis. This process necessitates the development of novel genomic data compression algorithms. In this paper a referential based method for compressing genomes has been proposed. The input and reference genomes are compared for dissimilarities and further entropy coded to achieve high compression ratio
References
S. D. Kahn. “On the future of genomic data. Science (Washington)”,vol.331,pp.728–729, 2011.
J. K. Bonfield and M. V. Mahoney, “Compression of FASTQ and SAM format sequencing data”, PLoS ONE,vol.8, issue.3, 2013.
S. Deorowicz, A. Danek, and M. Niemiec. Gdc , “Compression of large collections of genomes”, arXiv preprint arXiv:1503.01624, 2015.
Y. Zhang, L. Li, Y. Yang, X. Yang, S. He, and Z. Zhu. “Light-weight reference-based compression of FASTQ data”, BMC bioinformatics, vol.16, issue.1, pp.188, 2015.
E. S. Lander, et al., “Initial sequencing and analysis of the human genome”, Nature, vol. 409, pp. 860-921, 2001.
S. Kuruppu, S. J. Puglisi and J. Zobel, “Optimized relative Lempel-Ziv compression of genomes”, Proceeding of ACSC 2011.
P.SubrahmanyaandT.Berger, “Asliding window Lempel-Zivalgorithm for differential layer encoding in progressive transmission”, Proc. IEEE Int. Symp. Inf. Theory, Whistler, BC, Canada, pp. 266,995, 1995.
Kwang Su Jung, Nam Hee Yu, Seung Jung Shin, Keun Ho Ryu, “A Compressing Method for Genome Sequence Cluster using Sequence Alignment”, 2008
M.Mary Shanthi Rani, “A New Referential Method for Compressing Genomes” International Journal of Computational Bioinformatics and In Silico Modeling, Research Article Open Access, Vol. 4, issue.1, pp.592-596 2015.
Biji CL and AchuthsankarS.Nair, “Benchmark dataset for Whole Genome sequence compression”, pp.1545-5963, 2016.
RabiaArshad,AdeelSaleem and Danista Khan, “ Performance Comparison of Huffman Coding and Double Huffman Coding ”,978-1-pp.5090-2000,2016.
Komal Sharma, Kunal Gupta, “Losseless Data Comperssion Techniques and Their Performance”, ieee, 2017.
Kakoli Banerjee and A.Prasad, “Reference Based Inter Chromosomal Similarity based DNA sequence Compression algorithm”, ,ieee, 2017
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
