FCompress: An Algorithm for FASTQ Sequence Data Compression

Muhammad        Sardaraz; Muhammad        Tahir

doi:10.2174/1574893613666180322125337

Abstract

Background: Biological sequence data have increased at a rapid rate due to the advancements in sequencing technologies and reduction in the cost of sequencing data. The huge increase in these data presents significant research challenges to researchers. In addition to meaningful analysis, data storage is also a challenge, an increase in data production is outpacing the storage capacity. Data compression is used to reduce the size of data and thus reduces storage requirements as well as transmission cost over the internet.

Objective: This article presents a novel compression algorithm (FCompress) for Next Generation Sequencing (NGS) data in FASTQ format.

Method: The proposed algorithm uses bits manipulation and dictionary-based compression for bases compression. Headers are compressed with reference-based compression, whereas quality scores are compressed with Huffman coding.

Results: The proposed algorithm is validated with experimental results on real datasets. The results are compared with both general purpose and specialized compression programs.

Conclusion: The proposed algorithm produces better compression ratio in a comparable time to other algorithms.

Keywords: High throughput sequencing, NGS technologies, NGS sequence compression, Huffman Coding, Fcompress, Algorithm.

« Previous Next »

Graphical Abstract

[1] 
Sardaraz M, Tahir M, Ikram AA. Advances in highthroughput dna sequence data compression. J Bioinform Comput Biol  2016; 4(3): 1630002.
[2] 
Kahn SD. On the future of genomic data. Science  2011; 331: 728-9.
[3] 
Sardaraz M, Tahir M, Ikram AA, Bajwa H. SeqCompress: An algorithm for biological sequence compression. Genomics  2014; 104: 225-8.
[4] 
Deorowicz S, Grabowski S. Data compression for sequencing data. Algorithms Mol Biol  2013; 8(1): 25.
[5] 
Zhu Z, Zhang Y, Ji Z, He S, Yang X. High-throughput DNA sequence data compression. Brief Bioinform  2015; 1: 1-15.
[6] 
Daily K, Rigor P, Christley S, Xie X, Baldi P. Data structures and compression algorithms for high-throughput sequencing technologies. BMC Bioinformatics  2010; 11: 514.
[7] 
Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G. Compressing genomic sequence fragments using SlimGene. J Comput Biol  2011; 18: 401-13.
[8] 
Fritz MH-Y, Leinonen R, Cochrane G, Birney E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res  2011; 21: 734-40.
[9] 
Bonfield JK, Mahoney MV. Compression of FASTQ and SAM format sequencing data. PLoS One  2013; 8: e59190.
[10] 
Popitsch N, von Haeseler A. NGC: lossless and lossy compression of aligned high-throughput sequencing data. Nucleic Acids Res  2012; 41: e27-7.
[11] 
Jones DC, Ruzzo WL, Peng X, Katze MG. Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Res  2012; 40: e171-1.
[12] 
Zhang Y, Li L, Yang Y, Yang X, He S, Zhu Z. Light-weight reference-based compression of FASTQ data. BMC Bioinformatics  2015; 16: 188.
[13] 
Tembe W, Lowey J, Suh EG-SQZ. compact encoding of genomic sequence and quality data. Bioinformatics  2010; 26: 2192-4.
[14] 
Deorowicz S, Grabowski S. Compression of DNA sequence reads in FASTQ format. Bioinformatics  2011; 27: 860-2.
[15] 
Roguski , Deorowicz S. DSRC 2-Industry-oriented compression of FASTQ files. Bioinformatics  2014; 30: 2213-5.
[16] 
Nicolae M, Pathak S, Rajasekaran S. LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics  2015; 31: 3276-81.
[17] 
Grabowski S, Deorowicz S, Roguski . Disk-based genome sequencing data compression. Bioinformatics  2014; 31: 1389-95.
[18] 
Benoit G, Lemaitre C, Lavenier D, Rizk G. Compression of high throughput sequencing data with probabilistic de Bruijn graph. BMC Bioinformatics  2015; 16: 288.
[19] 
7z Home Page. www.7-zip.org (Accessed on September 25, 2016).
[20] 
Nicolae M, Pathak S, Rajasekaran S. LFQC: a lossless compression algorithm for FASTQ files. Bioinformatics  2015; 31(20): 3276-81.
[21] 
GZip Home Page.. http://www.gzip.org/ (Accessed on September
25, 2016).
[22] 
BZip Home Page.. http://www.bzip.org (Accessed on September
25, 2016).

Rights & Permissions Print Cite

Article Metrics

40

3

1

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893613666180322125337	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

FCompress: An Algorithm for FASTQ Sequence Data Compression

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract