ESREEM: Efficient Short Reads Error Estimation Computational Model for Next-generation Genome Sequencing

Muhammad       Tahir; Muhammad       Sardaraz; Zahid       Mehmood; Muhammad    Saud    Khan

doi:10.2174/1574893615999200614171832

Abstract

Aims: To assess the error profile in NGS data, generated from high throughput sequencing machines.

Background: Short-read sequencing data from Next Generation Sequencing (NGS) are currently being generated by a number of research projects. Depicting the errors produced by NGS platforms and expressing accurate genetic variation from reads are two inter-dependent phases. It has high significance in various analyses, such as genome sequence assembly, SNPs calling, evolutionary studies, and haplotype inference. The systematic and random errors show incidence profile for each of the sequencing platforms i.e. Illumina sequencing, Pacific Biosciences, 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Ion Torrent sequencing, and Oxford Nanopore sequencing. Advances in NGS deliver galactic data with the addition of errors. Some ratio of these errors may emulate genuine true biological signals i.e., mutation, and may subsequently negate the results. Various independent applications have been proposed to correct the sequencing errors. Systematic analysis of these algorithms shows that state-of-the-art models are missing.

Objective: In this paper, an effcient error estimation computational model called ESREEM is proposed to assess the error rates in NGS data.

Methods: The proposed model prospects the analysis that there exists a true linear regression association between the number of reads containing errors and the number of reads sequenced. The model is based on a probabilistic error model integrated with the Hidden Markov Model (HMM).

Results: The proposed model is evaluated on several benchmark datasets and the results obtained are compared with state-of-the-art algorithms.

Conclusion: Experimental results analyses show that the proposed model efficiently estimates errors and runs in less time as compared to others.

Keywords: NGS, genome, sequencing, error analysis, computational, algorithms.

« Previous

Graphical Abstract

[1] 
Tahir M, Sardaraz M, Ikram AA, Bajwa H. Review of genome sequence short read error correction algorithms. Am J Bioinform Res  2013; 3: 1-9.
[2] 
Tahir M, Sardaraz M, Aziz Ikram A, Bajwa H. HaShRECA: Hadoop based short read error correction algorithm for genome assembly. Curr Bioinform  2015; 10: 469-75.
[http://dx.doi.org/10.2174/157489361004150922151409] 
[3] 
Heydari M, Miclotte G, Demeester P, Van de Peer Y, Fostier J. Evaluation of the impact of Illumina error correction tools on de novo genome assembly. BMC Bioinformatics  2017; 18(1): 374.
[http://dx.doi.org/10.1186/s12859-017-1784-8] [PMID:  28821237] 
[4] 
Shendure J, Ji H. Next-generation DNA sequencing. Nat Biotechnol  2008; 26(10): 1135-45.
[http://dx.doi.org/10.1038/nbt1486] [PMID:  18846087] 
[5] 
Hoffmann S, Otto C, Kurtz S, et al. Fast mapping of short sequences with mismatches, insertions and deletions using index structures. PLOS Comput Biol  2009; 5(9), e1000502.
[http://dx.doi.org/10.1371/journal.pcbi.1000502] [PMID:  19750212] 
[6] 
Simpson JT. Exploring genome characteristics and sequence quality without a reference. Bioinformatics  2014; 30(9): 1228-35.
[http://dx.doi.org/10.1093/bioinformatics/btu023] [PMID:  24443382] 
[7] 
Bioinformatics B. FastQC: a quality control tool for high throughput sequence data. Cambridge, UK: Babraham Institute 2011.
[8] 
Trivedi UH, Cézard T, Bridgett S, et al. Quality control of next-generation sequencing data without a reference. Front Genet  2014; 5: 111.
[http://dx.doi.org/10.3389/fgene.2014.00111] [PMID:  24834071] 
[9] 
Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet  2016; 17(6): 333-51.
[http://dx.doi.org/10.1038/nrg.2016.49] [PMID:  27184599] 
[10] 
Minoche AE, Dohm JC, Himmelbauer H. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biol  2011; 12(11): R112.
[http://dx.doi.org/10.1186/gb-2011-12-11-r112] [PMID:  22067484] 
[11] 
Nakamura K, Oshima T, Morimoto T, et al. Sequence-specific error profile of Illumina sequencers. Nucleic Acids Res  2011; 39(13), e90.
[http://dx.doi.org/10.1093/nar/gkr344] [PMID:  21576222] 
[12] 
Abnizova I, Leonard S, Skelly T, et al. Analysis of context-dependent errors for illumina sequencing. J Bioinform Comput Biol  2012; 10(2), 1241005.
[http://dx.doi.org/10.1142/S0219720012410053] [PMID:  22809341] 
[13] 
Ross MG, Russ C, Costello M, et al. Characterizing and measuring bias in sequence data. Genome Biol  2013; 14(5): R51.
[http://dx.doi.org/10.1186/gb-2013-14-5-r51] [PMID:  23718773] 
[14] 
Janin L, Schulz-Trieglaff O, Cox AJ. BEETL-fastq: a searchable compressed archive for DNA reads. Bioinformatics  2014; 30(19): 2796-801.
[http://dx.doi.org/10.1093/bioinformatics/btu387] [PMID:  24950811] 
[15] 
Kchouk M, Elloumi M. An error correction and denovo assembly approach for nanopore reads using short reads. Curr Bioinform  2018; 13: 241-52.
[http://dx.doi.org/10.2174/1574893612666170530073736] 
[16] 
Erlich Y, Mitra PP, delaBastide M, McCombie WR, Hannon GJ. Alta-Cyclic: a self-optimizing base caller for next-generation sequencing. Nat Methods  2008; 5(8): 679-82.
[http://dx.doi.org/10.1038/nmeth.1230] [PMID:  18604217] 
[17] 
Rougemont J, Amzallag A, Iseli C, Farinelli L, Xenarios I, Naef F. Probabilistic base calling of Solexa sequencing data. BMC Bioinformatics  2008; 9: 431.
[http://dx.doi.org/10.1186/1471-2105-9-431] [PMID:  18851737] 
[18] 
Kao W-C, Stevens K, Song YS. BayesCall: a model-based basecalling algorithm for high-throughput short-read sequencing. Genome Res  2009; 19(10): 1884-95.
[http://dx.doi.org/10.1101/gr.095299.109] 
[19] 
Bravo HC, Irizarry RA. Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics  2010; 66(3): 665-74.
[http://dx.doi.org/10.1111/j.1541-0420.2009.01353.x] [PMID:  19912177] 
[20] 
Huse SM, Huber JA, Morrison HG, Sogin ML, Welch DM. Accuracy and quality of massively parallel DNA pyrosequencing. Genome Biol  2007; 8(7): R143.
[http://dx.doi.org/10.1186/gb-2007-8-7-r143] [PMID:  17659080] 
[21] 
Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res  2008; 36(16), e105.
[http://dx.doi.org/10.1093/nar/gkn425] [PMID:  18660515] 
[22] 
Hansen KD, Brenner SE, Dudoit S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res  2010; 38(12): e131-.
[http://dx.doi.org/10.1093/nar/gkq224] [PMID:  20395217] 
[23] 
Lou DI, Hussmann JA, McBee RM, et al. High-throughput DNA sequencing errors are reduced by orders of magnitude using circle sequencing. Proc Natl Acad Sci USA  2013; 110(49): 19872-7.
[http://dx.doi.org/10.1073/pnas.1319590110] [PMID:  24243955] 
[24] 
Hu X, Yuan J, Shi Y, et al. pIRS: Profile-based Illumina pair-end reads simulator. Bioinformatics  2012; 28(11): 1533-5.
[http://dx.doi.org/10.1093/bioinformatics/bts187] [PMID:  22508794] 
[25] 
Caboche S, Audebert C, Lemoine Y, Hot D. Comparison of mapping algorithms used in high-throughput sequencing: application to Ion Torrent data. BMC Genomics  2014; 15: 264.
[http://dx.doi.org/10.1186/1471-2164-15-264] [PMID:  24708189] 
[26] 
Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics  2012; 28(4): 593-4.
[http://dx.doi.org/10.1093/bioinformatics/btr708] [PMID:  22199392] 
[27] 
Hoban S, Bertorelle G, Gaggiotti OE. Computer simulations: tools for population and evolutionary genetics. Nat Rev Genet  2012; 13(2): 110-22.
[http://dx.doi.org/10.1038/nrg3130] [PMID:  22230817] 
[28] 
McElroy KE, Luciani F, Thomas T. GemSIM: general, error-model based simulator of next-generation sequencing data. BMC Genomics  2012; 13: 74.
[http://dx.doi.org/10.1186/1471-2164-13-74] [PMID:  22336055] 
[29] 
Knudsen B, Forsberg R, Miyamoto MM. A computer simulator for assessing different challenges and strategies of de novo sequence assembly. Genes  2010; 1(2): 263-82.
[http://dx.doi.org/10.3390/genes1020263] [PMID:  24710045] 
[30] 
Nielsen R, Paul JS, Albrechtsen A, Song YS. Genotype and SNP calling from next-generation sequencing data. Nat Rev Genet  2011; 12(6): 443-51.
[http://dx.doi.org/10.1038/nrg2986] [PMID:  21587300] 
[31] 
OMIC Tools .   Available from
          https://omictools.com/  (Accessed on 2018).
[32] 
Nikolenko SI, Korobeynikov AI, Alekseyev MA. BayesHammer: Bayesian clustering for error correction in single-cell sequencing. BMC Genomics  2013; 14(Suppl. 1): S7.
[http://dx.doi.org/10.1186/1471-2164-14-S1-S7] [PMID:  23368723] 
[33] 
Liu Y, Schröder J, Schmidt B. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics  2013; 29(3): 308-15.
[http://dx.doi.org/10.1093/bioinformatics/bts690] [PMID:  23202746] 
[34] 
Walker BJ, Abeel T, Shea T, et al. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One  2014; 9(11), e112963.
[http://dx.doi.org/10.1371/journal.pone.0112963] [PMID:  25409509] 
[35] 
Swain MT, Tsai IJ, Assefa SA, Newbold C, Berriman M, Otto TD. A post-assembly genome-improvement toolkit (PAGIT) to obtain annotated genomes from contigs. Nat Protoc  2012; 7(7): 1260-84.
[http://dx.doi.org/10.1038/nprot.2012.068] [PMID:  22678431] 
[36] 
Zagordi O, Klein R, Däumer M, Beerenwinkel N. Error correction of next-generation sequencing data and reliable estimation of HIV quasispecies. Nucleic Acids Res  2010; 38(21): 7400-9.
[http://dx.doi.org/10.1093/nar/gkq655] [PMID:  20671025] 
[37] 
Wang XV, Blades N, Ding J, Sultana R, Parmigiani G. Estimation of sequencing error rates in short reads. BMC Bioinformatics  2012; 13: 185.
[http://dx.doi.org/10.1186/1471-2105-13-185] [PMID:  22846331] 
[38] 
Bullard JH, Purdom E, Hansen KD, Dudoit S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments. BMC Bioinformatics  2010; 11: 94.
[http://dx.doi.org/10.1186/1471-2105-11-94] [PMID:  20167110] 
[39] 
Butler J, MacCallum I, Kleber M, et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res  2008; 18(5): 810-20.
[http://dx.doi.org/10.1101/gr.7337908] [PMID:  18340039] 
[40] 
Keele LJ. Semiparametric regression for the social sciences. John Wiley & Sons 2008.
[41] 
Schröder J, Schröder H, Puglisi SJ, Sinha R, Schmidt B. SHREC: a short-read error correction method. Bioinformatics  2009; 25(17): 2157-63.
[http://dx.doi.org/10.1093/bioinformatics/btp379] [PMID:  19542152] 
[42] 
Kelley DR, Schatz MC, Salzberg SL. Quake: quality-aware detection and correction of sequencing errors. Genome Biol  2010; 11(11): R116.
[http://dx.doi.org/10.1186/gb-2010-11-11-r116] [PMID:  21114842] 
[43] 
Li R, Zhu H, Ruan J, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res  2010; 20(2): 265-72.
[http://dx.doi.org/10.1101/gr.097261.109] [PMID:  20019144] 
[44] 
Salmela L. Correction of sequencing errors in a mixed set of reads. Bioinformatics  2010; 26(10): 1284-90.
[http://dx.doi.org/10.1093/bioinformatics/btq151] [PMID:  20378555] 
[45] 
Schröder J, Bailey J, Conway T, Zobel J. Reference-free validation of short read data. PLoS One  2010; 5(9), e12681.
[http://dx.doi.org/10.1371/journal.pone.0012681] [PMID:  20877643] 
[46] 
Melsted P, Pritchard JK. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics  2011; 12: 333.
[http://dx.doi.org/10.1186/1471-2105-12-333] [PMID:  21831268] 
[47] 
Heo Y, Wu X-L, Chen D, Ma J, Hwu W-M. BLESS: bloom filter-based error correction solution for high-throughput sequencing reads. Bioinformatics  2014; 30(10): 1354-62.
[http://dx.doi.org/10.1093/bioinformatics/btu030] [PMID:  24451628] 
[48] 
Sahay S. Optimum-time, optimum-space, algorithms for k-mer analysis of whole genome sequences. J Bioinform Comparative Genomics  2014; 1: 1.
[49] 
Zhu X, Wang J, Peng B, Shete S. Empirical estimation of sequencing error rates using smoothing splines. BMC Bioinformatics  2016; 17: 177.
[http://dx.doi.org/10.1186/s12859-016-1052-3] [PMID:  27102907] 
[50] 
National Center for Biotechnology Information.   Available from:
          https://www.ncbi.nlm.nih.gov/sra/  (Accessed on 2018).
[51] 
Shi L, Reid LH, Jones WD, et al. MAQC Consortium. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol  2006; 24(9): 1151-61.
[http://dx.doi.org/10.1038/nbt1239] [PMID:  16964229] 
[52] 
Birney E, Stamatoyannopoulos JA, Dutta A, et al. ENCODE Project Consortium. NISC Comparative Sequencing Program; Baylor College of Medicine Human Genome Sequencing Center; Washington University Genome Sequencing Center; Broad Institute; Children’s Hospital Oakland Research Institute. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature  2007; 447(7146): 799-816.
[http://dx.doi.org/10.1038/nature05874] [PMID:  17571346] 
[53] 
Yang X, Aluru S, Dorman KS. Repeat-aware modeling and correction of short read errors. BMC Bioinformatics  2011; 12(Suppl. 1): S52.
[http://dx.doi.org/10.1186/1471-2105-12-S1-S52] [PMID:  21342585] 
[54] 
Krogh A, Larsson B, von Heijne G, Sonnhammer EL. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol  2001; 305(3): 567-80.
[http://dx.doi.org/10.1006/jmbi.2000.4315] [PMID:  11152613] 
[55] 
Yoon B-J. Hidden Markov models and their applications in biological sequence analysis. Curr Genomics  2009; 10(6): 402-15.
[http://dx.doi.org/10.2174/138920209789177575] [PMID:  20190955] 

Rights & Permissions Print Cite

Article Metrics

14

1

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893615999200614171832	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

ESREEM: Efficient Short Reads Error Estimation Computational Model for Next-generation Genome Sequencing

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract