Abstract
Background: The non-coding RNA identification at the organelle genome level is a challenging task. In our previous work, an ncRNA dataset with less than 80% sequence identity was built, and a method incorporating an increment of diversity combining with support vector machine method was proposed.
Objective: Based on the ncRNA_361 dataset, a novel decision-making method-an improved KNN (iKNN) classifier was proposed.
Methods: In this paper, based on the iKNN algorithm, the physicochemical features of nucleotides, the degeneracy of genetic codons, and topological secondary structure were selected to represent the effective ncRNA characters. Then, the incremental feature selection method was utilized to optimize the feature set.
Results: The results of iKNN indicated that the decision-making method of mean value is distinctly superior to the traditional decision-making method of majority vote the Increment of Diversity Combining Support Vector Machine (ID-SVM). The iKNN algorithm achieved an overall accuracy of 97.368% in the jackknife test, when k=3.
Conclusion: It should be noted that the triplets of the structure-sequence mode under reading frames not only contains the entire sequence information but also reflects whether the base was paired or not, and the secondary structural topological parameters further describe the ncRNA secondary structure on the spatial level. The ncRNA dataset and the iKNN classifier are freely available at http://202.207.14.87:8032/fuwu/iKNN/index.asp.
Keywords: Organelle genome, non-coding RNA, open reading frame, spatial structure, feature selection, K-nearest neighbor method.
Graphical Abstract
[http://dx.doi.org/10.4161/rna.20481] [PMID: 22664915]
[http://dx.doi.org/10.1126/science.300.5626.1646] [PMID: 12805516]
[http://dx.doi.org/10.1158/0008-5472.CAN-10-2483] [PMID: 21199792]
[http://dx.doi.org/10.4161/rna.20107] [PMID: 22664918]
[http://dx.doi.org/10.4161/rna.20972] [PMID: 22664913]
[http://dx.doi.org/10.4161/rna.9.1.18009] [PMID: 22258151]
[PMID: 27543076]
[http://dx.doi.org/10.1016/j.ygeno.2015.12.002] [PMID: 26697761]
[http://dx.doi.org/10.1093/nar/gkr1175] [PMID: 22135294]
[http://dx.doi.org/10.1016/j.bbrc.2007.02.071] [PMID: 17346678]
[http://dx.doi.org/10.1021/pr060167c] [PMID: 16889410]
[http://dx.doi.org/10.1039/C4MB00681J] [PMID: 25607774]
[http://dx.doi.org/10.1016/j.bbrc.2007.09.098] [PMID: 17931599]
[http://dx.doi.org/10.1371/journal.pone.0009931] [PMID: 20368981]
[http://dx.doi.org/10.1093/bioinformatics/btl158] [PMID: 16731699]
[http://dx.doi.org/10.1016/j.jtbi.2008.03.015] [PMID: 18471832]
[http://dx.doi.org/10.2174/092986608786071184] [PMID: 19075826]
[http://dx.doi.org/10.1016/j.bbrc.2006.06.059] [PMID: 16808903]
[http://dx.doi.org/10.1002/jcb.21096] [PMID: 16983686]
[http://dx.doi.org/10.1016/j.bbrc.2007.08.140] [PMID: 17880924]
[http://dx.doi.org/10.1016/j.bbrc.2007.06.027] [PMID: 17586467]
[PMID: 23514608]
[http://dx.doi.org/10.1504/IJBRA.2010.035998] [PMID: 20940122]
[http://dx.doi.org/10.1093/nar/gkl1065] [PMID: 17169992]
[http://dx.doi.org/10.1016/j.ygeno.2013.07.009] [PMID: 23891614]
[http://dx.doi.org/10.1186/1471-2105-8-182] [PMID: 17553157]
[http://dx.doi.org/10.1016/j.febslet.2005.04.045] [PMID: 15878553]
[http://dx.doi.org/10.1016/j.ygeno.2011.04.011] [PMID: 21586321]
[http://dx.doi.org/10.1186/1471-2105-6-310] [PMID: 16381612]
[http://dx.doi.org/10.1007/s00438-015-1078-7] [PMID: 26085220]
[http://dx.doi.org/10.1371/journal.pone.0121501] [PMID: 25821974]
[http://dx.doi.org/10.1016/j.ab.2014.12.009] [PMID: 25596338]
[http://dx.doi.org/10.1093/bioinformatics/btv471] [PMID: 26275897]
[http://dx.doi.org/10.1007/s41048-015-0001-4] [PMID: 26942214]
[http://dx.doi.org/10.1016/j.ygeno.2012.02.001] [PMID: 22349176]
[http://dx.doi.org/10.1007/s12539-013-0205-6] [PMID: 25205501]
[http://dx.doi.org/10.1021/pr700715c] [PMID: 18260610]
[http://dx.doi.org/10.1109/]]
[http://dx.doi.org/10.3109/10409239509083488] [PMID: 7587280]