Abstract
Background: SnoRNAs (Small nucleolar RNAs) are small RNA molecules with approximately 60-300 nucleotides in sequence length. They have been proved to play important roles in cancer occurrence and progression. It is of great clinical importance to identify new snoRNAs as fast and accurately as possible.
Objective: A novel algorithm, ESDA (Elastically Sparse Partial Least Squares Discriminant Analysis), was proposed to improve the speed and the performance of recognizing snoRNAs from other RNAs in human genomes.
Methods: In ESDA algorithm, to optimize the extracted information, kernel features were selected from the variables extracted from both primary sequences and secondary structures. Then they were used by SPLSDA (sparse partial least squares discriminant analysis) algorithm as input variables for the final classification model training to distinguish snoRNA sequences from other Human RNAs. Due to the fact that no prior biological knowledge is request to optimize the classification model, ESDA is a very practical method especially for completely new sequences.
Results: 89 H/ACA snoRNAs and 269 C/D snoRNAs of human were used as positive samples and 3403 non-snoRNAs as negative samples to test the identification performance of the proposed ESDA. For the H/ACA snoRNAs identification, the sensitivity and specificity were respectively as high as 99.6% and 98.8%. For C/D snoRNAs, they were respectively 96.1% and 98.3%. Furthermore, we compared ESDA with other widely used algorithms and classifiers: SnoReport, RF (Random Forest), DWD (Distance Weighted Discrimination) and SVM (Support Vector Machine). The highest improvement of accuracy obtained by ESDA was 25.1%.
Conclusion: Strongly proved the superiority performance of ESDA and make it promising for identifying SnoRNAs for further development of the precision medicine for cancers.
Keywords: Human snoRNA, elastic net algorithm, sparse partial least squares discriminant analysis, identification, algorithm, development, cancer.
Graphical Abstract
[http://dx.doi.org/10.1007/BF02510475] [PMID: 9211966]
[http://dx.doi.org/10.1016/S0300-9084(02)01402-5] [PMID: 12457565]
[http://dx.doi.org/10.1016/S0006-291X(02)02623-2] [PMID: 12437969]
[http://dx.doi.org/10.1016/j.dld.2016.12.029] [PMID: 28110922]
[http://dx.doi.org/10.1093/hmg/ddm375] [PMID: 18202102]
[http://dx.doi.org/10.1016/S1673-8527(08)60134-4] [PMID: 19683667]
[http://dx.doi.org/10.1186/1476-4598-9-198] [PMID: 20663213]
[http://dx.doi.org/10.1261/rna.1876210] [PMID: 20038629]
[http://dx.doi.org/10.1261/rna.2210406] [PMID: 16373490]
[http://dx.doi.org/10.1126/science.283.5405.1168] [PMID: 10024243]
[http://dx.doi.org/10.1093/bioinformatics/btm464] [PMID: 17895272]
[http://dx.doi.org/10.1093/nar/gkl672] [PMID: 16990247]
[http://dx.doi.org/10.1111/j.1467-9868.2005.00503.x]
[http://dx.doi.org/10.1002/gcc.22460]
[http://dx.doi.org/10.1186/1745-6150-8-23] [PMID: 24067167]
[http://dx.doi.org/10.1186/1471-2105-12-253] [PMID: 21693065]
[http://dx.doi.org/10.1093/nar/gkj002] [PMID: 16381836]
[http://dx.doi.org/10.1080/07391102.1994.10508031] [PMID: 8204213]
[http://dx.doi.org/10.1093/nar/28.14.2804] [PMID: 10908339]
[http://dx.doi.org/10.1093/bioinformatics/btg467] [PMID: 14764563]
[http://dx.doi.org/10.1089/cmb.2011.0078] [PMID: 22401589]
[http://dx.doi.org/10.1093/nar/gkg599] [PMID: 12824340]
[http://dx.doi.org/10.1186/1471-2105-6-310] [PMID: 16381612]
[http://dx.doi.org/10.1093/nar/gkm368] [PMID: 17553836]
[http://dx.doi.org/10.1198/016214507000001120]
[http://dx.doi.org/10.1023/A:1010933404324]