Abstract
Rationale: PIWI-interacting RNAs (piRNAs) are a recently-discovered class of small noncoding RNAs (ncRNAs) with a length of 21-35 nucleotides. They play a role in gene expression regulation, transposon silencing, and viral infection inhibition. Once considered as “dark matter” of ncRNAs, piRNAs emerged as important players in multiple cellular functions in different organisms. However, our knowledge of piRNAs is still very limited as many piRNAs have not been yet identified due to lack of robust computational predictive tools.
Methods: To identify novel piRNAs, we developed piRNAPred, an integrated framework for piRNA prediction employing hybrid features like k-mer nucleotide composition, secondary structure, thermodynamic and physicochemical properties. A non-redundant dataset (D3349 or D1684p+1665n) comprising 1684 experimentally verified piRNAs and 1665 non-piRNA sequences was obtained from piRBase and NONCODE, respectively. These sequences were subjected to the computation of various sequence- structure based features in binary format and trained using different machine learning techniques, of which support vector machine (SVM) performed the best. Results: During the ten-fold cross-validation approach (10-CV), piRNAPred achieved an overall accuracy of 98.60% with Mathews correlation coefficient (MCC) of 0.97 and receiver operating characteristic (ROC) of 0.99. Furthermore, we achieved a dimensionality reduction of feature space using an attribute selected classifier. Conclusion: We obtained the highest performance in accurately predicting piRNAs as compared to the current state-of-the-art piRNA predictors. In conclusion, piRNAPred would be helpful to expand the piRNA repertoire, and provide new insights on piRNA functions.Keywords: piRNA, classification, algorithm, prediction, non-coding RNA, physicochemical.
Graphical Abstract
[http://dx.doi.org/10.1101/gad.1026102] [PMID: 12414724]
[http://dx.doi.org/10.1146/annurev.cellbio.24.110707.175327] [PMID: 19575643]
[http://dx.doi.org/10.1016/j.tibs.2010.03.009] [PMID: 20395147]
[http://dx.doi.org/10.1101/sqb.2006.71.048] [PMID: 17381282]
[http://dx.doi.org/10.1101/gad.12.23.3715] [PMID: 9851978]
[http://dx.doi.org/10.1016/j.molcel.2004.07.007] [PMID: 15260970]
[http://dx.doi.org/10.1016/j.tibs.2015.12.008] [PMID: 26810602]
[http://dx.doi.org/10.1146/annurev-genet-120417-031441] [PMID: 30476449]
[http://dx.doi.org/10.1016/S1534-5807(03)00228-4] [PMID: 12919683]
[http://dx.doi.org/10.1038/nrm3089] [PMID: 21427766]
[http://dx.doi.org/10.1093/nar/gkp167] [PMID: 19321499]
[http://dx.doi.org/10.1016/j.cub.2017.08.036] [PMID: 28966088]
[http://dx.doi.org/10.1101/gad.203786.112] [PMID: 23124062]
[http://dx.doi.org/10.1016/j.cell.2007.01.043] [PMID: 17346786]
[http://dx.doi.org/10.1038/s41467-017-00854-4] [PMID: 29018194]
[http://dx.doi.org/10.1038/nature11502] [PMID: 23064227]
[http://dx.doi.org/10.1016/j.molcel.2011.07.029] [PMID: 21925389]
[http://dx.doi.org/10.1261/rna.744307] [PMID: 17872506]
[http://dx.doi.org/10.1016/j.cub.2007.06.030] [PMID: 17604629]
[http://dx.doi.org/10.1016/j.molcel.2018.08.007] [PMID: 30193099]
[http://dx.doi.org/10.1126/science.aaa1039] [PMID: 25977553]
[http://dx.doi.org/10.1016/j.celrep.2015.06.030] [PMID: 26166577]
[http://dx.doi.org/10.1126/science.aaa1264] [PMID: 25977554]
[http://dx.doi.org/10.1038/s41576-018-0073-3] [PMID: 30446728]
[http://dx.doi.org/10.1016/S0022-2836(05)80360-2] [PMID: 2231712]
[http://dx.doi.org/https://doi.org/10.1093/nar/gkp335]
[http://dx.doi.org/10.1093/bioinformatics/btr016] [PMID: 21224287]
[http://dx.doi.org/10.1093/nar/gky1043] [PMID: 30371818]
[http://dx.doi.org/10.1371/journal.pcbi.0030222] [PMID: 17997596]
[http://dx.doi.org/10.1186/s12859-014-0419-6] [PMID: 25547961]
[http://dx.doi.org/10.1186/1471-2105-6-310] [PMID: 16381612]
[http://dx.doi.org/10.1039/C4MB00447G] [PMID: 25230731]
[http://dx.doi.org/10.1371/journal.pone.0153268] [PMID: 27074043]
[http://dx.doi.org/10.1186/s12859-016-1206-3] [PMID: 27578422]
[http://dx.doi.org/10.1016/j.omtn.2017.04.008] [PMID: 28624202]
[http://dx.doi.org/10.1093/nar/gkr1175] [PMID: 22135294]
[http://dx.doi.org/10.1038/nature10672] [PMID: 22121019]
[http://dx.doi.org/10.1534/g3.117.044024] [PMID: 28696921]
[http://dx.doi.org/10.1093/database/bau103] [PMID: 25380780]
[http://dx.doi.org/10.1186/1748-7188-6-26] [PMID: 22115189]
[http://dx.doi.org/10.1016/S0092-8674(03)00801-8] [PMID: 14567918]
[http://dx.doi.org/10.1186/1479-5876-11-305] [PMID: 24330765]
[http://dx.doi.org/10.1186/1471-2105-7-65] [PMID: 16472402]
[http://dx.doi.org/10.1007/978-1-4757-2440-0]
[http://dx.doi.org/10.1093/bioinformatics/bth261] [PMID: 15073010]
[http://dx.doi.org/10.1371/journal.pone.0023443] [PMID: 21853133]
[http://dx.doi.org/10.1038/nrm2632] [PMID: 19165215]
[http://dx.doi.org/10.1038/nature04916] [PMID: 16751777]
[http://dx.doi.org/10.1038/nature04917] [PMID: 16751776]
[http://dx.doi.org/10.1101/gad.1434406] [PMID: 16766680]
[http://dx.doi.org/10.1128/MCB.24.15.6742-6750.2004] [PMID: 15254241]
[http://dx.doi.org/10.1016/j.celrep.2015.07.030] [PMID: 26257181]
[http://dx.doi.org/10.1038/nature17150] [PMID: 26950602]
[http://dx.doi.org/10.1093/nar/gkm696] [PMID: 17881367]