MaxDEL: Accurate and Efficient Calling of Genomic Deletions from
Single Molecular Real-time Sequencing Using Integrated Method

Xinyu      Yu; Yaoxian      Lv; Lei      Cai; Jingyang      Gao

doi:10.2174/1574893618666230224160716

Abstract

Background: Single-molecule real-time (SMRT) sequencing data are characterized by long read length and high read depth. Compared to next-generation sequencing (NGS), SMRT sequencing data can present more structural variations (SVs) and have greater advantages in calling variation. However, there are high sequencing errors and noises in SMRT sequencing data, which causes inaccuracy in calling SVs from sequencing data. Most existing tools cannot overcome sequencing errors and detect genomic deletions.

Objective: In this investigation, we propose a new method for calling deletions from SMRT sequencing data called MaxDEL.

Methods: Firstly, MaxDEL uses a machine learning method to calibrate the deletion regions from the variant call format (VCF) file. Secondly, it develops a novel feature visualization method to convert the variant features to images and uses these images to accurately call the deletions based on a convolutional neural network (CNN).

Results: The result shows that MaxDEL performs better in terms of accuracy and recall for calling variants when compared to existing methods in both real data and simulative data.

Conclusion: MaxDEL can effectively overcome SMRT sequencing data's noise and integrate new machine learning and deep learning technologies. The method can capture the variant features of the deletions and establish the learning model between images and gene data. In our experiment, the MaxDEL method is superior to NextSV, SVIM, Sniffles, Picky and SMRT-SV, especially in recall and F1-score.

« Previous Next »

Graphical Abstract

[1]
Roberts RJ, Carneiro MO, Schatz MC. The advantages of SMRT sequencing. Genome Biol  2013; 14(6): 405.
 [http://dx.doi.org/10.1186/gb-2013-14-6-405] [PMID:  23822731]

[2]
Takeda H, Yamashita T, Ueda Y, Sekine A. Exploring the hepatitis C virus genome using single molecule real-time sequencing. World J Gastroenterol  2019; 25(32): 4661-72.
 [http://dx.doi.org/10.3748/wjg.v25.i32.4661] [PMID:  31528092]

[3]
Sudmant PH, Rausch T, Gardner EJ, et al. An integrated map of structural variation in 2,504 human genomes. Nature  2015; 526(7571): 75-81.
 [http://dx.doi.org/10.1038/nature15394] [PMID:  26432246]

[4]
Sudmant PH, Kitzman JO, Antonacci F, et al. Diversity of human copy number variation and multicopy genes. Science  2010; 330(6004): 641-6.
 [http://dx.doi.org/10.1126/science.1197005] [PMID:  21030649]

[5]
Korbel JO, Urban AE, Affourtit JP, et al. Paired-end mapping reveals extensive structural variation in the human genome. Science  2007; 318(5849): 420-6.
 [http://dx.doi.org/10.1126/science.1149504] [PMID:  17901297]

[6]
Handsaker RE, Van Doren V, Berman JR, et al. Large multiallelic copy number variations in humans. Nat Genet  2015; 47(3): 296-303.
 [http://dx.doi.org/10.1038/ng.3200] [PMID:  25621458]

[7]
Schneider VA, Graves-Lindsay T, Howe K, et al. Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly. Genome Res  2017; 27(5): 849-64.
 [http://dx.doi.org/10.1101/gr.213611.116] [PMID:  28396521]

[8]
Loomis EW, Eid JS, Peluso P, et al. Sequencing the unsequenceable: Expanded CGG-repeat alleles of the fragile X gene. Genome Res  2013; 23(1): 121-8.
 [http://dx.doi.org/10.1101/gr.141705.112] [PMID:  23064752]

[9]
Rasko DA, Webster DR, Sahl JW, et al. Origins of the E. coli strain causing an outbreak of hemolytic-uremic syndrome in Germany. N Engl J Med  2011; 365(8): 709-17.
 [http://dx.doi.org/10.1056/NEJMoa1106920] [PMID:  21793740]

[10]
Chaisson MJP, Sanders AD, Zhao X, et al. Multi-platform discovery of haplotype-resolved structural variation in human genomes. Nat Commun  2019; 10(1): 1784.
 [http://dx.doi.org/10.1038/s41467-018-08148-z] [PMID:  30992455]

[11]
Jenko Bizjan B, Katsila T, Tesovnik T, et al. Challenges in identifying large germline structural variants for clinical use by long read sequencing. Comput Struct Biotechnol J  2020; 18: 83-92.
 [http://dx.doi.org/10.1016/j.csbj.2019.11.008] [PMID:  32099591]

[12]
English AC, Salerno WJ, Reid JG. PBHoney: Identifying genomic variants via long-read discordance and interrupted mapping. BMC Bioinformatics  2014; 15(1): 180-0.
 [http://dx.doi.org/10.1186/1471-2105-15-180] [PMID:  24915764]

[13]
Sedlazeck FJ, Rescheneder P, Smolka M, et al. Accurate detection of complex structural variations using single-molecule sequencing. Nat Methods  2018; 15(6): 461-8.
 [http://dx.doi.org/10.1038/s41592-018-0001-7] [PMID:  29713083]

[14]
Gong L, Wong CH, Cheng WC, et al. Picky comprehensively detects high-resolution structural variants in nanopore long reads. Nat Methods  2018; 15(6): 455-60.
 [http://dx.doi.org/10.1038/s41592-018-0002-6] [PMID:  29713081]

[15]
Heller D, Vingron M. SVIM: Structural variant identification using mapped long reads. Bioinformatics  2019; 35(17): 2907-15.
 [http://dx.doi.org/10.1093/bioinformatics/btz041] [PMID:  30668829]

[16]
Huddleston J, Chaisson MJP, Steinberg KM, et al. Discovery and genotyping of structural variation from long-read haploid genome sequence data. Genome Res  2017; 27(5): 677-85.
 [http://dx.doi.org/10.1101/gr.214007.116] [PMID:  27895111]

[17]
Li F, Jiang H, Depeng W, Next SV. A meta-caller for structural variants from low-coverage SMRT data. BMC Bioinformatics  2018; 19(1): 180-0.
 [http://dx.doi.org/10.1186/s12859-018-2207-1] [PMID:  29792160]

[18]
Poplin R, Chang PC, Alexander D, et al. A universal SNP and small-indel variant caller using deep neural networks. Nat Biotechnol  2018; 36(10): 983-7.
 [http://dx.doi.org/10.1038/nbt.4235] [PMID:  30247488]

[19]
Cai L, Wu Y, Gao J, Deep SV. Accurate calling of genomic deletions from high-throughput sequencing data using deep convolutional neural network. BMC Bioinformatics  2019; 20(1): 665-5.
 [http://dx.doi.org/10.1186/s12859-019-3299-y] [PMID:  31830921]

[20]
Chaisson MJ, Tesler G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): Application and theory. BMC Bioinformatics  2012; 13(1): 238-8.
 [http://dx.doi.org/10.1186/1471-2105-13-238] [PMID:  22988817]

[21]
Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. Comput Sci 2015.
 [http://dx.doi.org/10.48550/arXiv.1511.06434]

[22]
Zook JM, Hansen NF, Olson ND, et al. A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol  2020; 38(11): 1347-55.
 [http://dx.doi.org/10.1038/s41587-020-0538-8 ] [PMID:  32541955]

[23]
Ho SS, Urban AE, Mills RE. Structural variation in the sequencing era. Nat Rev Genet  2019; 1-19.
 [PMID:  31729472]

[24]
Jeffares DC, Jolly C, Hoti M, et al. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat Commun  2017; 8(1): 14061.
 [http://dx.doi.org/10.1038/ncomms14061] [PMID:  28117401]

[25]
Zhang W, Jia B, Wei C. Pass: A sequencing simulator for PacBio sequencing. BMC Bioinformatics  2019; 20(1): 352.
 [http://dx.doi.org/10.1186/s12859-019-2901-7] [PMID:  31226925]

Rights & Permissions Print Cite

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/1574893618666230224160716	Print ISSN 1574-8936
Publisher Name Bentham Science Publisher	Online ISSN 2212-392X

Current Bioinformatics

MaxDEL: Accurate and Efficient Calling of Genomic Deletions from Single Molecular Real-time Sequencing Using Integrated Method

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract