Abstract
Background: Single-molecule real-time (SMRT) sequencing data are characterized by long read length and high read depth. Compared to next-generation sequencing (NGS), SMRT sequencing data can present more structural variations (SVs) and have greater advantages in calling variation. However, there are high sequencing errors and noises in SMRT sequencing data, which causes inaccuracy in calling SVs from sequencing data. Most existing tools cannot overcome sequencing errors and detect genomic deletions.
Objective: In this investigation, we propose a new method for calling deletions from SMRT sequencing data called MaxDEL.
Methods: Firstly, MaxDEL uses a machine learning method to calibrate the deletion regions from the variant call format (VCF) file. Secondly, it develops a novel feature visualization method to convert the variant features to images and uses these images to accurately call the deletions based on a convolutional neural network (CNN).
Results: The result shows that MaxDEL performs better in terms of accuracy and recall for calling variants when compared to existing methods in both real data and simulative data.
Conclusion: MaxDEL can effectively overcome SMRT sequencing data's noise and integrate new machine learning and deep learning technologies. The method can capture the variant features of the deletions and establish the learning model between images and gene data. In our experiment, the MaxDEL method is superior to NextSV, SVIM, Sniffles, Picky and SMRT-SV, especially in recall and F1-score.
Graphical Abstract
[http://dx.doi.org/10.1186/gb-2013-14-6-405] [PMID: 23822731]
[http://dx.doi.org/10.3748/wjg.v25.i32.4661] [PMID: 31528092]
[http://dx.doi.org/10.1038/nature15394] [PMID: 26432246]
[http://dx.doi.org/10.1126/science.1197005] [PMID: 21030649]
[http://dx.doi.org/10.1126/science.1149504] [PMID: 17901297]
[http://dx.doi.org/10.1038/ng.3200] [PMID: 25621458]
[http://dx.doi.org/10.1101/gr.213611.116] [PMID: 28396521]
[http://dx.doi.org/10.1101/gr.141705.112] [PMID: 23064752]
[http://dx.doi.org/10.1056/NEJMoa1106920] [PMID: 21793740]
[http://dx.doi.org/10.1038/s41467-018-08148-z] [PMID: 30992455]
[http://dx.doi.org/10.1016/j.csbj.2019.11.008] [PMID: 32099591]
[http://dx.doi.org/10.1186/1471-2105-15-180] [PMID: 24915764]
[http://dx.doi.org/10.1038/s41592-018-0001-7] [PMID: 29713083]
[http://dx.doi.org/10.1038/s41592-018-0002-6] [PMID: 29713081]
[http://dx.doi.org/10.1093/bioinformatics/btz041] [PMID: 30668829]
[http://dx.doi.org/10.1101/gr.214007.116] [PMID: 27895111]
[http://dx.doi.org/10.1186/s12859-018-2207-1] [PMID: 29792160]
[http://dx.doi.org/10.1038/nbt.4235] [PMID: 30247488]
[http://dx.doi.org/10.1186/s12859-019-3299-y] [PMID: 31830921]
[http://dx.doi.org/10.1186/1471-2105-13-238] [PMID: 22988817]
[http://dx.doi.org/10.48550/arXiv.1511.06434]
[http://dx.doi.org/10.1038/s41587-020-0538-8 ] [PMID: 32541955]
[PMID: 31729472]
[http://dx.doi.org/10.1038/ncomms14061] [PMID: 28117401]
[http://dx.doi.org/10.1186/s12859-019-2901-7] [PMID: 31226925]