Abstract
Background: Single Amino Acid Polymorphisms (SAPs) or nonsynonymous Single Nucleotide Variants (nsSNVs) are the most common genetic variations. They result from missense mutations where a single base pair substitution changes the genetic code in such a way that the triplet of bases (codon) at a given position is coding a different amino acid. Since genetic mutations sometimes cause genetic diseases, it is important to comprehend and foresee which variations are harmful and which ones are neutral (not causing changes in the phenotype). This can be posed as a classification problem.
Methods: Computational methods using machine intelligence are gradually replacing repetitive and exceedingly overpriced mutagenic tests. By and large, uneven quality, deficiencies, and irregularities of nsSNVs datasets debase the convenience of artificial intelligence-based methods. Subsequently, strong and more exact approaches are needed to address these problems. In the present work paper, we show a consensus classifier built on the holdout sampler, which appears strong and precise and outflanks all other popular methods.
Results: We produced 100 holdouts to test the structures and diverse classification variables of diverse classifiers during the training phase. The finest performing holdouts were chosen to develop a consensus classifier and tested using a k-fold (1 ≤ k ≤5) cross-validation method. We also examined which protein properties have the biggest impact on the precise prediction of the effects of nsSNVs.
Conclusion: Our Consensus Holdout Sampler outflanks other popular algorithms, and gives excellent results, highly accurate with low standard deviation. The advantage of our method emerges from using a tree of holdouts, where diverse LM/AI-based programs are sampled in diverse ways.
Graphical Abstract
[http://dx.doi.org/10.1016/S0168-9525(00)01988-0] [PMID: 10782110]
[http://dx.doi.org/10.1038/10290] [PMID: 10391209]
[http://dx.doi.org/10.1101/gr.8.12.1229] [PMID: 9872978]
[http://dx.doi.org/10.1038/nature09534] [PMID: 20981092]
[http://dx.doi.org/10.1126/science.278.5343.1580] [PMID: 9411782]
[http://dx.doi.org/10.1126/science.273.5281.1516] [PMID: 8801636]
[http://dx.doi.org/10.1042/BJ20121221] [PMID: 23301657]
[http://dx.doi.org/10.1038/10297] [PMID: 10391210]
[http://dx.doi.org/10.1093/bib/bbr070] [PMID: 22247263]
[http://dx.doi.org/10.1038/ng.3586] [PMID: 27294619]
[http://dx.doi.org/10.1186/gm494] [PMID: 24073752]
[http://dx.doi.org/10.1371/journal.pone.0046688] [PMID: 23056405]
[http://dx.doi.org/10.1093/bioinformatics/btv195] [PMID: 25851949]
[http://dx.doi.org/10.1038/nprot.2009.86] [PMID: 19561590]
[http://dx.doi.org/10.1093/bioinformatics/btw222] [PMID: 27193693]
[http://dx.doi.org/10.1101/gr.176214.114] [PMID: 25217195]
[http://dx.doi.org/10.1002/humu.23193] [PMID: 28230923]
[http://dx.doi.org/10.1038/nmeth0810-575] [PMID: 20676075]
[http://dx.doi.org/10.1093/nar/gkr407] [PMID: 21727090]
[http://dx.doi.org/10.1038/nmeth0410-248] [PMID: 20354512]
[http://dx.doi.org/10.1186/1471-2164-14-S3-S6] [PMID: 23819482]
[http://dx.doi.org/10.1093/bioinformatics/btl423] [PMID: 16895930]
[http://dx.doi.org/10.1371/journal.pcbi.1003440] [PMID: 24453961]
[http://dx.doi.org/10.1101/gr.3804205] [PMID: 15965030]
[http://dx.doi.org/10.1073/pnas.1511585112] [PMID: 26269570]
[http://dx.doi.org/10.1016/S0022-2836(02)00813-6] [PMID: 12270722]
[http://dx.doi.org/10.1016/j.jmb.2013.07.014] [PMID: 23871686]
[http://dx.doi.org/10.1038/srep19848] [PMID: 26797105]
[http://dx.doi.org/10.1093/bib/bbq073] [PMID: 21300697]
[http://dx.doi.org/10.1002/humu.21445] [PMID: 21412949]
[http://dx.doi.org/10.1146/annurev.genom.7.080505.115630] [PMID: 16824020]
[http://dx.doi.org/10.1109/MCAS.2006.1688199]
[http://dx.doi.org/10.1186/1471-2164-14-S3-S2] [PMID: 23819846]
[http://dx.doi.org/10.1016/j.ajhg.2011.03.004] [PMID: 21457909]
[PMID: 18045787]
[http://dx.doi.org/10.1190/geo2011-0341.1]
[http://dx.doi.org/10.1016/j.jappgeo.2013.07.005]
[http://dx.doi.org/10.1016/j.cam.2019.112571]
[http://dx.doi.org/10.3390/biom10010067] [PMID: 31906171]
[http://dx.doi.org/10.3390/ijms21103594] [PMID: 32438758]
[http://dx.doi.org/10.3390/ijms20194681] [PMID: 31546608]
[http://dx.doi.org/10.1016/B978-0-12-817133-2.00008-2]
[http://dx.doi.org/10.1007/978-3-030-61401-0_55]
[http://dx.doi.org/10.1007/978-1-4899-4541-9]
[http://dx.doi.org/10.1023/A:1010933404324]
[http://dx.doi.org/10.1016/S0022-2836(05)80360-2] [PMID: 2231712]
[http://dx.doi.org/10.1101/gr.772403] [PMID: 12952881]
[http://dx.doi.org/10.1093/nar/gkl229] [PMID: 16912992]
[http://dx.doi.org/10.1002/prot.24682] [PMID: 25204636]
[http://dx.doi.org/10.1007/978-3-319-78759-6_3]
[http://dx.doi.org/10.1016/j.jappgeo.2018.12.022]
[http://dx.doi.org/10.1016/j.neucom.2005.12.126]
[http://dx.doi.org/10.1007/s12559-014-9255-2]
[http://dx.doi.org/10.1109/TNN.2006.875977] [PMID: 16856652]
[http://dx.doi.org/10.1007/s12559-015-9333-0]
[http://dx.doi.org/10.1109/TSMCB.2011.2168604] [PMID: 21984515]
[http://dx.doi.org/10.1109/SIU.2013.6531269]
[http://dx.doi.org/10.3390/jpm12020175] [PMID: 35207663]
[http://dx.doi.org/10.1016/j.csbj.2020.06.017] [PMID: 32637044]
[http://dx.doi.org/10.1186/s40246-022-00396-x] [PMID: 35879805]
[http://dx.doi.org/10.1038/s41467-022-29268-7] [PMID: 35365602]
[http://dx.doi.org/10.1371/journal.pcbi.1001025] [PMID: 21152010]
[http://dx.doi.org/10.1093/bioinformatics/bty897] [PMID: 30376034]