Abstract
Objective: Representation learning in the context of biological concepts involves acquiring their numerical representations through various sources of biological information, such as sequences, interactions, and literature. This study has conducted a comprehensive systematic review by analyzing both quantitative and qualitative data to provide an overview of this field.
Methods: Our systematic review involved searching for articles on the representation learning of biological concepts in PubMed and EMBASE databases. Among the 507 articles published between 2015 and 2022, we carefully screened and selected 65 papers for inclusion. We then developed a structured workflow that involved identifying relevant biological concepts and data types, reviewing various representation learning techniques, and evaluating downstream applications for assessing the quality of the learned representations.
Results: The primary focus of this review was on the development of numerical representations for gene/DNA/RNA entities. We have found Word2Vec to be the most commonly used method for biological representation learning. Moreover, several studies are increasingly utilizing state-of-the-art large language models to learn numerical representations of biological concepts. We also observed that representations learned from specific sources were typically used for single downstream applications that were relevant to the source.
Conclusion: Existing methods for biological representation learning are primarily focused on learning representations from a single data type, with the output being fed into predictive models for downstream applications. Although there have been some studies that have explored the use of multiple data types to improve the performance of learned representations, such research is still relatively scarce. In this systematic review, we have provided a summary of the data types, models, and downstream applications used in this task.
[http://dx.doi.org/10.1002/jcc.23718] [PMID: 25212657]
[http://dx.doi.org/10.1093/bioinformatics/btw255] [PMID: 27307608]
[http://dx.doi.org/10.1136/jamia.1998.0050571] [PMID: 9824804]
[http://dx.doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9]
[http://dx.doi.org/10.1037/0033-295X.104.2.211]
[http://dx.doi.org/10.1002/aris.1440380105]
[http://dx.doi.org/10.7717/peerj.11262] [PMID: 33986992]
[http://dx.doi.org/10.1023/A:1007617005950]
[http://dx.doi.org/10.1016/j.jbi.2009.02.002] [PMID: 19232399]
[http://dx.doi.org/10.1093/bioinformatics/btp158] [PMID: 19349285]
[http://dx.doi.org/10.1093/nar/gku678]
[http://dx.doi.org/10.1016/S0306-4573(02)00021-3]
[http://dx.doi.org/10.3115/v1/D14-1162]
[http://dx.doi.org/10.1162/tacl_a_00051]
[http://dx.doi.org/10.18653/v1/N18-1202]
[http://dx.doi.org/10.1145/2736277.2741093]
[http://dx.doi.org/10.1145/2939672.2939754]
[http://dx.doi.org/10.1093/bib/bbab005] [PMID: 33539511]
[http://dx.doi.org/10.1093/bioinformatics/btab133] [PMID: 33638635]
[http://dx.doi.org/10.3389/fgene.2020.605620] [PMID: 33408741]
[http://dx.doi.org/10.1038/s41592-019-0511-y] [PMID: 31406384]
[http://dx.doi.org/10.1093/bioinformatics/bty178] [PMID: 29584811]
[http://dx.doi.org/10.1261/rna.069112.118] [PMID: 30425123]
[http://dx.doi.org/10.1186/s12864-018-4459-6] [PMID: 29764360]
[http://dx.doi.org/10.3390/cells8020122] [PMID: 30717470]
[http://dx.doi.org/10.1371/journal.pcbi.1006721] [PMID: 30807567]
[PMID: 30606105]
[http://dx.doi.org/10.1186/s12859-019-3006-z] [PMID: 31492094]
[http://dx.doi.org/10.1016/j.ab.2019.04.011] [PMID: 31022378]
[http://dx.doi.org/10.1038/s41598-019-38746-w] [PMID: 30837494]
[http://dx.doi.org/10.1093/bioinformatics/bty228] [PMID: 29949978]
[http://dx.doi.org/10.1093/bioinformatics/btaa656] [PMID: 32692832]
[http://dx.doi.org/10.1093/bioinformatics/btaa580] [PMID: 32573679]
[http://dx.doi.org/10.1093/bib/bbaa022] [PMID: 32181478]
[http://dx.doi.org/10.1016/j.gpb.2018.08.004] [PMID: 30639696]
[http://dx.doi.org/10.3390/molecules25194372] [PMID: 32977679]
[http://dx.doi.org/10.1371/journal.pone.0141287] [PMID: 26555596]
[http://dx.doi.org/10.1016/j.csbj.2021.03.015] [PMID: 33868598]
[http://dx.doi.org/10.7717/peerj.3579] [PMID: 28729956]
[http://dx.doi.org/10.1093/bib/bbab360] [PMID: 34498677]
[http://dx.doi.org/10.1093/bib/bbab342] [PMID: 34415289]
[http://dx.doi.org/10.3389/fimmu.2021.680687] [PMID: 34367141]
[http://dx.doi.org/10.1186/s12859-019-3220-8] [PMID: 31847804]
[http://dx.doi.org/10.3390/genes10040273] [PMID: 30987229]
[http://dx.doi.org/10.3390/genes10110924] [PMID: 31726752]
[http://dx.doi.org/10.1093/bioinformatics/btx264] [PMID: 28444127]
[http://dx.doi.org/10.1093/bioinformatics/btx234] [PMID: 28881969]
[http://dx.doi.org/10.1093/bioinformatics/btab349] [PMID: 33978703]
[http://dx.doi.org/10.3390/biom11121783] [PMID: 34944427]
[http://dx.doi.org/10.1142/S0219720018400279] [PMID: 30567477]
[http://dx.doi.org/10.1371/journal.pcbi.1007617] [PMID: 32324731]
[http://dx.doi.org/10.1016/j.ymeth.2018.05.026] [PMID: 29883746]
[http://dx.doi.org/10.1016/j.jid.2018.09.018] [PMID: 30342048]
[http://dx.doi.org/10.1186/s12864-018-5370-x] [PMID: 30712510]
[http://dx.doi.org/10.1038/s41598-018-32180-0] [PMID: 30213980]
[http://dx.doi.org/10.3390/genes11020153] [PMID: 32023848]
[http://dx.doi.org/10.1371/journal.pone.0258623] [PMID: 34653224]
[http://dx.doi.org/10.1109/JBHI.2018.2870728] [PMID: 31283472]
[http://dx.doi.org/10.1038/s41434-019-0099-y] [PMID: 31455874]
[http://dx.doi.org/10.1371/journal.pone.0238915] [PMID: 32970681]
[http://dx.doi.org/10.1371/journal.pcbi.1008229] [PMID: 32936825]
[http://dx.doi.org/10.1016/j.bbapap.2020.140477] [PMID: 32593761]
[http://dx.doi.org/10.1016/j.jbi.2018.06.015] [PMID: 29959033]
[http://dx.doi.org/10.1093/bioinformatics/btaa459] [PMID: 32657369]
[http://dx.doi.org/10.1016/j.ymeth.2020.05.010] [PMID: 32446956]
[http://dx.doi.org/10.1155/2020/6248686]
[http://dx.doi.org/10.1093/bib/bbac083] [PMID: 35323894]
[http://dx.doi.org/10.1371/journal.pcbi.1009655] [PMID: 34890410]
[http://dx.doi.org/10.1093/bib/bbab513] [PMID: 34889446]
[http://dx.doi.org/10.1093/bib/bbab494] [PMID: 34864877]
[http://dx.doi.org/10.1109/JBHI.2021.3130110] [PMID: 34813484]
[http://dx.doi.org/10.1016/j.ymeth.2021.10.008] [PMID: 34748953]
[http://dx.doi.org/10.1371/journal.pone.0258626] [PMID: 34653225]
[http://dx.doi.org/10.1093/bib/bbab407] [PMID: 34585231]
[http://dx.doi.org/10.1093/bib/bbab361] [PMID: 34486019]
[http://dx.doi.org/10.1038/s41467-020-14974-x] [PMID: 32127534]
[http://dx.doi.org/10.3390/ijms20236046] [PMID: 31801264]
[http://dx.doi.org/10.1093/bioinformatics/btab252] [PMID: 33978702]
[http://dx.doi.org/10.1186/s12920-018-0349-7] [PMID: 29697361]
[http://dx.doi.org/10.1093/bioinformatics/btaa701] [PMID: 32797179]
[http://dx.doi.org/10.1093/bioinformatics/btaa1077] [PMID: 33367690]
[http://dx.doi.org/10.1145/3388440.3412477]