Abstract
Most of the currently available knowledge about protein structure and function has been obtained from laboratory experiments. As a complement to this classical knowledge discovery activity, bioinformatics-assisted sequence analysis, which relies primarily on biological data manipulation, is becoming an indispensable option for the modern discovery of new knowledge, especially when large amounts of protein-encoding sequences can be easily identified from the annotation of highthroughput genomic data. Here, we review the advances in bioinformatics-assisted protein sequence analysis to highlight how bioinformatics analysis will aid in understanding protein structure and function. We first discuss the analyses with individual protein sequences as input, from which some basic parameters of proteins (e.g., amino acid composition, MW and PTM) can be predicted. In addition to these basic parameters that can be directly predicted by analyzing a protein sequence alone, many predictions are based on principles drawn from knowledge of many well-studied proteins, with multiple sequence comparisons as input. Identification of conserved sites by comparing multiple homologous sequences, prediction of the folding, structure or function of uncharacterized proteins, construction of phylogenies of related sequences, analysis of the contribution of conserved related sites to protein function by SCA or DCA, elucidation of the significance of codon usage, and extraction of functional units from protein sequences and coding spaces belong to this category. We then discuss the revolutionary invention of the "QTY code" that can be applied to convert membrane proteins into water- soluble proteins but at the cost of marginal introduced structural and functional changes. As machine learning has been done in other scientific fields, machine learning has profoundly impacted protein sequence analysis. In summary, we have highlighted the relevance of the bioinformatics-assisted analysis for protein research as a valuable guide for laboratory experiments.
Graphical Abstract
[http://dx.doi.org/10.1126/science.181.4096.223] [PMID: 4124164]
[http://dx.doi.org/10.1016/j.sbi.2003.09.005] [PMID: 14568614]
[http://dx.doi.org/10.1021/pr401300m] [PMID: 24874765]
[http://dx.doi.org/10.1002/elps.1150180707] [PMID: 9237557]
[http://dx.doi.org/10.1038/s41592-019-0437-4] [PMID: 31235882]
[http://dx.doi.org/10.1016/S1367-5931(02)00015-7] [PMID: 12547423]
[http://dx.doi.org/10.12688/f1000research.15274.1] [PMID: 30613379]
[http://dx.doi.org/10.1038/s41579-019-0243-0] [PMID: 31485032]
[http://dx.doi.org/10.3892/ijmm.2017.3036] [PMID: 28656226]
[http://dx.doi.org/10.1038/s41581-019-0129-4] [PMID: 30858582]
[http://dx.doi.org/10.1002/bies.201800167] [PMID: 31549739]
[http://dx.doi.org/10.1093/bib/bbaa128] [PMID: 32613242]
[http://dx.doi.org/10.1038/s41598-020-77173-0] [PMID: 33235255]
[http://dx.doi.org/10.1021/bi992922o] [PMID: 10757967]
[http://dx.doi.org/10.3390/ijms151222518] [PMID: 25490136]
[http://dx.doi.org/10.1038/nrm1589] [PMID: 15738986]
[http://dx.doi.org/10.1038/nrm3920] [PMID: 25531225]
[http://dx.doi.org/10.1093/bioinformatics/btx345] [PMID: 28575391]
[http://dx.doi.org/10.1038/s41576-019-0122-6] [PMID: 30971806]
[http://dx.doi.org/10.1038/s41592-019-0496-6] [PMID: 31308553]
[http://dx.doi.org/10.7717/peerj.12847] [PMID: 35310161]
[http://dx.doi.org/10.1021/acs.jproteome.8b00341] [PMID: 30222357]
[http://dx.doi.org/10.1016/j.bbadis.2014.06.015] [PMID: 24995601]
[http://dx.doi.org/10.1093/bioinformatics/btx779] [PMID: 29211823]
[http://dx.doi.org/10.1016/j.sbi.2016.04.006] [PMID: 27179293]
[http://dx.doi.org/10.1016/j.ygeno.2017.06.007] [PMID: 28669847]
[http://dx.doi.org/10.1007/978-1-60327-159-2_7] [PMID: 18566763]
[http://dx.doi.org/10.1007/s10529-020-02914-0] [PMID: 32430802]
[PMID: 27562963]
[http://dx.doi.org/10.1093/sysbio/syy036] [PMID: 29771363]
[http://dx.doi.org/10.1385/1-59745-116-9:171] [PMID: 16957337]
[http://dx.doi.org/10.1038/s41556-022-00993-x] [PMID: 36266487]
[http://dx.doi.org/10.1098/rstb.2016.0523]
[http://dx.doi.org/10.1038/s41467-017-00889-7] [PMID: 29021545]
[http://dx.doi.org/10.1038/s41586-022-04994-6] [PMID: 35896746]
[http://dx.doi.org/10.1146/annurev.biophys.29.1.291] [PMID: 10940251]
[http://dx.doi.org/10.1038/nrg1324] [PMID: 15143319]
[http://dx.doi.org/10.1007/978-1-4939-9869-2_10] [PMID: 31612442]
[http://dx.doi.org/10.1038/s41594-019-0347-2] [PMID: 31873300]
[http://dx.doi.org/10.1073/pnas.1912132117] [PMID: 31871208]
[http://dx.doi.org/10.1038/s42003-019-0677-y] [PMID: 31799431]
[http://dx.doi.org/10.1016/j.jmb.2021.167382] [PMID: 34863778]
[http://dx.doi.org/10.1007/s11274-020-02837-y] [PMID: 32266578]
[http://dx.doi.org/10.1371/journal.pcbi.1003176] [PMID: 23990764]
[http://dx.doi.org/10.1016/j.sbi.2016.10.001] [PMID: 27756047]
[http://dx.doi.org/10.1371/journal.pcbi.1004817] [PMID: 27254668]
[http://dx.doi.org/10.1073/pnas.1111471108] [PMID: 22106262]
[http://dx.doi.org/10.7554/eLife.34300] [PMID: 30024376]
[http://dx.doi.org/10.1074/jbc.RA120.012605] [PMID: 32217694]
[http://dx.doi.org/10.1096/fj.201900948RR] [PMID: 31907985]
[http://dx.doi.org/10.1073/pnas.1508584112] [PMID: 26487681]
[http://dx.doi.org/10.1073/pnas.1314045110] [PMID: 24009338]
[http://dx.doi.org/10.1371/journal.pcbi.1004262] [PMID: 26046683]
[http://dx.doi.org/10.1371/journal.pcbi.1005294] [PMID: 28002465]
[http://dx.doi.org/10.1038/s41598-019-55118-6] [PMID: 32015389]
[http://dx.doi.org/10.1142/S0219720005001648] [PMID: 16374913]
[http://dx.doi.org/10.1007/s10969-011-9104-4] [PMID: 21452025]
[http://dx.doi.org/10.1371/journal.pone.0052847] [PMID: 23300796]
[http://dx.doi.org/10.1016/bs.mie.2019.11.003] [PMID: 32046848]
[http://dx.doi.org/10.1021/acssynbio.8b00121] [PMID: 29979580]
[http://dx.doi.org/10.1021/ja411302m] [PMID: 24392935]
[http://dx.doi.org/10.1016/j.tibs.2008.10.002] [PMID: 18996013]
[http://dx.doi.org/10.1021/ar5000117] [PMID: 24784899]
[http://dx.doi.org/10.1111/nyas.14019] [PMID: 30843242]
[http://dx.doi.org/10.1016/j.cell.2016.09.022] [PMID: 27984720]
[http://dx.doi.org/10.1074/jbc.REV119.006348] [PMID: 30936208]
[http://dx.doi.org/10.1016/j.actatropica.2017.06.006] [PMID: 28606821]
[http://dx.doi.org/10.1038/s41586-022-04823-w] [PMID: 35676473]
[http://dx.doi.org/10.1016/0014-5793(96)00727-2] [PMID: 8706870]
[http://dx.doi.org/10.1038/nature01255] [PMID: 12432405]
[http://dx.doi.org/10.1371/journal.pone.0050039] [PMID: 23185527]
[http://dx.doi.org/10.1016/j.jtbi.2017.07.009] [PMID: 28716385]
[http://dx.doi.org/10.1002/prot.1035] [PMID: 11288174]
[http://dx.doi.org/10.1016/j.jtbi.2012.11.005] [PMID: 23154188]
[http://dx.doi.org/10.1038/srep07972] [PMID: 25609314]
[http://dx.doi.org/10.1038/d41586-020-03348-4] [PMID: 33257889]
[http://dx.doi.org/10.1126/science.add2187] [PMID: 36108050]
[http://dx.doi.org/10.1093/bib/bbac102] [PMID: 35348602]
[http://dx.doi.org/10.1038/s41580-019-0163-x] [PMID: 31417196]
[http://dx.doi.org/10.1038/nbt.4238] [PMID: 30247489]
[http://dx.doi.org/10.1073/pnas.1811031115] [PMID: 30154163]
[http://dx.doi.org/10.1016/j.isci.2020.101670] [PMID: 33376963]
[http://dx.doi.org/10.1073/pnas.1909026116] [PMID: 31776256]
[http://dx.doi.org/10.1016/j.sbi.2021.11.002] [PMID: 34896756]
[http://dx.doi.org/10.1016/j.copbio.2022.102713] [PMID: 35413604]
[http://dx.doi.org/10.1016/j.bmcl.2021.127852] [PMID: 33609660]
[http://dx.doi.org/10.1007/978-1-0716-2285-8_5] [PMID: 35482186]
[http://dx.doi.org/10.1016/j.cbpa.2021.04.005] [PMID: 34015749]
[http://dx.doi.org/10.1007/s10930-021-10003-y] [PMID: 34050498]
[http://dx.doi.org/10.3390/biom10040626] [PMID: 32316682]
[http://dx.doi.org/10.3390/molecules27144568] [PMID: 35889440]
[http://dx.doi.org/10.1016/j.sbi.2019.12.005] [PMID: 31881449]
[http://dx.doi.org/10.3390/biom12091246] [PMID: 36139085]
[http://dx.doi.org/10.1080/13543776.2020.1851679] [PMID: 33187458]
[http://dx.doi.org/10.3390/ijms21197047] [PMID: 32987946]
[http://dx.doi.org/10.1016/j.biotechadv.2020.107603] [PMID: 32738381]
[http://dx.doi.org/10.3390/molecules25122850] [PMID: 32575664]
[http://dx.doi.org/10.1038/s41598-020-73644-6] [PMID: 33024236]
[http://dx.doi.org/10.1093/bioinformatics/btz483] [PMID: 31197318]