Abstract
Background: Proteins and nucleic acids are vital biomolecules that contribute significantly to biological life. The precise and efficient identification of hot spots at protein-nucleic acid interfaces is crucial for guiding drug development, advancing protein engineering, and exploring the underlying molecular recognition mechanisms. As experimental methods like alanine scanning mutagenesis prove to be time-consuming and expensive, a growing number of machine learning techniques are being employed to predict hot spots. However, the existing approach is distinguished by a lack of uniform standards, a scarcity of data, and a wide range of attributes. Currently, there is no comprehensive overview or evaluation of this field. As a result, providing a full overview and review is extremely helpful.
Methods: In this study, we present an overview of cutting-edge machine learning approaches utilized for hot spot prediction in protein-nucleic acid complexes. Additionally, we outline the feature categories currently in use, derived from relevant biological data sources, and assess conventional feature selection methods based on 600 extracted features. Simultaneously, we create two new benchmark datasets, PDHS87 and PRHS48, and develop distinct binary classification models based on these datasets to evaluate the advantages and disadvantages of various machine-learning techniques.
Results: Prediction of protein-nucleic acid interaction hotspots is a challenging task. The study demonstrates that structural neighborhood features play a crucial role in identifying hot spots. The prediction performance can be improved by choosing effective feature selection methods and machine learning methods. Among the existing prediction methods, XGBPRH has the best performance.
Conclusion: It is crucial to continue studying hot spot theories, discover new and effective features, add accurate experimental data, and utilize DNA/RNA information. Semi-supervised learning, transfer learning, and ensemble learning can optimize predictive ability. Combining computational docking with machine learning methods can potentially further improve predictive performance.
Graphical Abstract
[http://dx.doi.org/10.3390/genes10030242] [PMID: 30901953]
[http://dx.doi.org/10.1126/science.7529940] [PMID: 7529940]
[http://dx.doi.org/10.1016/0076-6879(91)02020-A] [PMID: 1723781]
[http://dx.doi.org/10.1016/j.bmcl.2003.09.098] [PMID: 15006368]
[http://dx.doi.org/10.1006/jmbi.1998.1843] [PMID: 9653027]
[http://dx.doi.org/10.1093/bioinformatics/btp058] [PMID: 19179356]
[http://dx.doi.org/10.1261/rna.066464.118] [PMID: 30093489]
[http://dx.doi.org/10.1038/nature02129] [PMID: 14615802]
[http://dx.doi.org/10.1006/jmbi.1997.1149] [PMID: 9245598]
[http://dx.doi.org/10.1016/S0959-440X(00)00190-1] [PMID: 11297928]
[http://dx.doi.org/10.1021/bi061903t] [PMID: 17266332]
[http://dx.doi.org/10.1093/bib/bbaa373] [PMID: 33406224]
[http://dx.doi.org/10.1038/s41598-018-32511-1] [PMID: 30250210]
[http://dx.doi.org/10.1186/s12859-018-2009-5] [PMID: 29334889]
[http://dx.doi.org/10.1186/1471-2105-11-174] [PMID: 20377884]
[http://dx.doi.org/10.1093/nar/gkt544] [PMID: 23788679]
[http://dx.doi.org/10.1186/1471-2105-13-89] [PMID: 22574904]
[http://dx.doi.org/10.1093/bib/bbv023] [PMID: 25935161]
[http://dx.doi.org/10.1093/bib/bbx168] [PMID: 29253082]
[http://dx.doi.org/10.3390/biology9100325] [PMID: 33036150]
[http://dx.doi.org/10.1007/s00521-022-07024-8]
[http://dx.doi.org/10.1371/journal.pcbi.1006615] [PMID: 30533007]
[http://dx.doi.org/10.1093/bioinformatics/btx698] [PMID: 29091991]
[http://dx.doi.org/10.1093/nar/gkab848] [PMID: 34606614]
[http://dx.doi.org/10.1093/nar/gkp132] [PMID: 19273533]
[http://dx.doi.org/10.2174/1574893612666170125124538]
[http://dx.doi.org/10.1016/j.neucom.2014.12.123]
[http://dx.doi.org/10.18632/oncotarget.7695] [PMID: 26934646]
[http://dx.doi.org/10.1093/nar/25.17.3389] [PMID: 9254694]
[http://dx.doi.org/10.1093/bioinformatics/btu744] [PMID: 25391399]
[http://dx.doi.org/10.1016/j.str.2003.10.002] [PMID: 14604535]
[http://dx.doi.org/10.1371/journal.pcbi.1000376] [PMID: 19412530]
[http://dx.doi.org/10.1073/pnas.89.22.10915] [PMID: 1438297]
[http://dx.doi.org/10.1038/ncomms3741] [PMID: 24225580]
[http://dx.doi.org/10.1093/bioinformatics/bty653] [PMID: 30032213]
[http://dx.doi.org/10.1093/bib/bbz037] [PMID: 30957840]
[http://dx.doi.org/10.1002/bip.360221211] [PMID: 6667333]
[http://dx.doi.org/10.1038/srep11476] [PMID: 26098304]
[http://dx.doi.org/10.1093/bioinformatics/btp240] [PMID: 19357097]
[http://dx.doi.org/10.1002/pro.5560071211] [PMID: 9865952]
[http://dx.doi.org/10.1006/jmbi.1994.1334] [PMID: 8182748]
[http://dx.doi.org/10.1093/bioinformatics/btx585] [PMID: 28968673]
[http://dx.doi.org/10.1155/2014/971258]
[http://dx.doi.org/10.1186/1472-6807-8-21] [PMID: 18400099]
[http://dx.doi.org/10.1093/nar/gkw383] [PMID: 27151201]
[http://dx.doi.org/10.1186/1471-2105-12-14] [PMID: 21223604]
[http://dx.doi.org/10.1371/journal.pone.0179314] [PMID: 28614374]
[http://dx.doi.org/10.1093/bioinformatics/btx822] [PMID: 29281004]
[http://dx.doi.org/10.1093/bioinformatics/btn222] [PMID: 18467349]
[http://dx.doi.org/10.1021/ci500760m] [PMID: 25845030]
[PMID: 32631222]
[http://dx.doi.org/10.1093/nargab/lqab109] [PMID: 34805992]
[http://dx.doi.org/10.1016/j.csda.2012.09.020]
[http://dx.doi.org/10.1186/s12859-020-03683-3] [PMID: 32938395]
[http://dx.doi.org/10.1186/s12859-020-03675-3] [PMID: 32938375]
[http://dx.doi.org/10.1186/s12859-020-03871-1] [PMID: 34000983]
[http://dx.doi.org/10.1093/nar/gkv876] [PMID: 26365245]
[http://dx.doi.org/10.3233/FI-2010-288]
[http://dx.doi.org/10.1093/bioinformatics/btac138] [PMID: 35253843]
[http://dx.doi.org/10.1109/TIT.1967.1053964]
[http://dx.doi.org/10.1007/BF00994018]
[http://dx.doi.org/10.1038/nbt1206-1565] [PMID: 17160063]
[http://dx.doi.org/10.1162/EVCO_a_00101] [PMID: 23339552]
[http://dx.doi.org/10.1023/A:1010933404324]
[http://dx.doi.org/10.1007/978-1-59745-530-5_14] [PMID: 18450055]
[http://dx.doi.org/10.1093/nar/gkp1158] [PMID: 20008102]
[http://dx.doi.org/10.1023/A:1007465528199]
[http://dx.doi.org/10.1007/978-0-387-84858-7_16]
[http://dx.doi.org/10.1016/S0167-9473(01)00065-2]
[http://dx.doi.org/10.1006/jcss.1997.1504]
[http://dx.doi.org/10.1186/s40537-020-00369-8] [PMID: 33169094]
[http://dx.doi.org/10.1007/978-3-642-36657-4_7]
[http://dx.doi.org/10.1186/1471-2105-10-426] [PMID: 20015386]
[http://dx.doi.org/10.1186/s12859-016-1369-y] [PMID: 28155651]
[http://dx.doi.org/10.1016/j.jmgm.2017.01.003] [PMID: 28285094]
[http://dx.doi.org/10.1016/j.csbj.2022.08.070] [PMID: 36212542]
[http://dx.doi.org/10.1093/bioinformatics/btac733] [PMID: 36377772]
[http://dx.doi.org/10.1038/s41586-021-03819-2] [PMID: 34265844]