Abstract
Predicting the function of proteins is a major challenge in the scientific community, particularly in the post-genomic era. Traditional methods of determining protein functions, such as experiments, are accurate but can be resource-intensive and time-consuming. The development of Next Generation Sequencing (NGS) techniques has led to the production of a large number of new protein sequences, which has increased the gap between available raw sequences and verified annotated sequences. To address this gap, automated protein function prediction (AFP) techniques have been developed as a faster and more cost-effective alternative, aiming to maintain the same accuracy level.
Several automatic computational methods for protein function prediction have recently been developed and proposed. This paper reviews the best-performing AFP methods presented in the last decade and analyzes their improvements over time to identify the most promising strategies for future methods.
Identifying the most effective method for predicting protein function is still a challenge. The Critical Assessment of Functional Annotation (CAFA) has established an international standard for evaluating and comparing the performance of various protein function prediction methods. In this study, we analyze the best-performing methods identified in recent editions of CAFA. These methods are divided into five categories based on their principles of operation: sequence-based, structure-based, combined-based, ML-based and embeddings-based.
After conducting a comprehensive analysis of the various protein function prediction methods, we observe that there has been a steady improvement in the accuracy of predictions over time, mainly due to the implementation of machine learning techniques. The present trend suggests that all the bestperforming methods will use machine learning to improve their accuracy in the future.
We highlight the positive impact that the use of machine learning (ML) has had on protein function prediction. Most recent methods developed in this area use ML, demonstrating its importance in analyzing biological information and making predictions. Despite these improvements in accuracy, there is still a significant gap compared with experimental evidence. The use of new approaches based on Deep Learning (DL) techniques will probably be necessary to close this gap, and while significant progress has been made in this area, there is still more work to be done to fully realize the potential of DL.
Graphical Abstract
[http://dx.doi.org/10.1007/978-3-319-41279-5_7]
[http://dx.doi.org/10.1038/nrg.2016.49] [PMID: 27184599]
[http://dx.doi.org/10.1038/75556] [PMID: 10802651]
[http://dx.doi.org/10.1186/1471-2164-8-222] [PMID: 17620139]
[http://dx.doi.org/10.1110/ps.49201] [PMID: 11316881]
[http://dx.doi.org/10.21236/ADA472211]
[http://dx.doi.org/10.1038/nmeth.2340] [PMID: 23353650]
[http://dx.doi.org/10.1186/s13059-016-1037-6] [PMID: 27604469]
[http://dx.doi.org/10.1186/s13059-019-1835-8] [PMID: 31744546]
[http://dx.doi.org/10.1017/S0033583503003901] [PMID: 15029827]
[http://dx.doi.org/10.1016/S0968-0004(98)01335-8]
[http://dx.doi.org/10.1093/nar/gkaa1100] [PMID: 33237286]
[http://dx.doi.org/10.1016/j.ymeth.2015.08.021] [PMID: 26318087]
[http://dx.doi.org/10.1016/j.ymeth.2015.08.009] [PMID: 26277418]
[http://dx.doi.org/10.1093/bioinformatics/btu739] [PMID: 25398609]
[http://dx.doi.org/10.1093/nar/gkh956] [PMID: 15576349]
[http://dx.doi.org/10.1093/bioinformatics/bty130]
[http://dx.doi.org/10.1587/transinf.E94.D.1854]
[http://dx.doi.org/10.4137/EBO.S8681]
[http://dx.doi.org/10.1093/bioinformatics/btu031] [PMID: 24451626]
[http://dx.doi.org/10.1093/bioinformatics/btv345] [PMID: 26130574]
[http://dx.doi.org/10.1038/nmeth.3213] [PMID: 25549265]
[http://dx.doi.org/10.1093/nar/gkm251] [PMID: 17478507]
[http://dx.doi.org/10.1093/nar/gks966] [PMID: 23087378]
[http://dx.doi.org/10.1093/nar/gkx366] [PMID: 28472402]
[http://dx.doi.org/10.1038/srep31865] [PMID: 27561554]
[http://dx.doi.org/10.1145/130385.130401]
[http://dx.doi.org/10.1093/nar/gkn193]
[http://dx.doi.org/10.1093/nar/25.17.3389] [PMID: 9254694]
[http://dx.doi.org/10.1038/s41586-021-03819-2] [PMID: 34265844]
[http://dx.doi.org/10.1002/prot.26171] [PMID: 34218458]
[http://dx.doi.org/10.1002/prot.26222] [PMID: 34453465]
[http://dx.doi.org/10.1016/j.jmb.2018.03.004] [PMID: 29534977]
[http://dx.doi.org/10.1093/nar/gku1003] [PMID: 25352553]
[http://dx.doi.org/10.1186/1471-2105-14-S3-S1] [PMID: 23514099]
[http://dx.doi.org/10.1109/BIBM47256.2019.8983075]
[http://dx.doi.org/10.1109/ICEngTechnol.2017.8308186]
[http://dx.doi.org/10.1186/s40649-019-0069-y]
[http://dx.doi.org/10.1101/615260]
[http://dx.doi.org/10.1093/nar/gkv1248] [PMID: 26582926]
[http://dx.doi.org/10.1093/bioinformatics/btab270]
[http://dx.doi.org/10.1109/TNN.2008.2005605] [PMID: 19068426]
[http://dx.doi.org/10.1093/nar/gkaa913] [PMID: 33125078]
[http://dx.doi.org/10.1093/nar/gkn762]
[http://dx.doi.org/10.1093/nar/gkaa1079] [PMID: 33237325]
[http://dx.doi.org/10.1093/nar/gkx1069]
[http://dx.doi.org/10.1093/nar/gkz991] [PMID: 31777944]
[http://dx.doi.org/10.1093/bioinformatics/bty178]
[http://dx.doi.org/10.1016/j.cels.2021.05.017]
[http://dx.doi.org/10.1038/s41598-020-80786-0] [PMID: 33441905]
[http://dx.doi.org/10.1186/s12859-019-3220-8] [PMID: 31847804]
[http://dx.doi.org/10.1093/bioinformatics/btab198]
[http://dx.doi.org/10.1093/nar/gku1113] [PMID: 25378336]