Generic placeholder image

Current Bioinformatics

Editor-in-Chief

ISSN (Print): 1574-8936
ISSN (Online): 2212-392X

Research Article

Prediction of Super-enhancers Based on Mean-shift Undersampling

Author(s): Han Cheng, Shumei Ding and Cangzhi Jia*

Volume 19, Issue 7, 2024

Published on: 27 November, 2023

Page: [651 - 662] Pages: 12

DOI: 10.2174/0115748936268302231110111456

Price: $65

Abstract

Background: Super-enhancers are clusters of enhancers defined based on the binding occupancy of master transcription factors, chromatin regulators, or chromatin marks. It has been reported that super-enhancers are transcriptionally more active and cell-type-specific than regular enhancers. Therefore, it is necessary to identify super-enhancers from regular enhancers. A variety of computational methods have been proposed to identify super-enhancers as auxiliary tools. However, most methods use ChIP-seq data, and the lack of this part of the data will make the predictor unable to execute or fail to achieve satisfactory performance.

Objective: The aim of this study is to propose a stacking computational model based on the fusion of multiple features to identify super-enhancers in both human and mouse species.

Methods: This work adopted mean-shift to cluster majority class samples and selected four sets of balanced datasets for mouse and three sets of balanced datasets for human to train the stacking model. Five types of sequence information are used as input to the XGBoost classifier, and the average value of the probability outputs from each classifier is designed as the final classification result.

Results: The results of 10-fold cross-validation and cross-cell-line validation prove that our method has superior performance compared to other existing methods. The source code and datasets are available at https://github.com/Cheng-Han-max/SE_voting.

Conclusion: The analysis of feature importance indicates that Mismatch accounts for the highest proportion among the top 20 important features.

Graphical Abstract

[1]
Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: From properties to genome-wide predictions. Nat Rev Genet 2014; 15(4): 272-86.
[http://dx.doi.org/10.1038/nrg3682] [PMID: 24614317]
[2]
Whyte WA, Orlando DA, Hnisz D, et al. Master transcription factors and mediator establish super-enhancers at key cell identity genes. Cell 2013; 153(2): 307-19.
[http://dx.doi.org/10.1016/j.cell.2013.03.035] [PMID: 23582322]
[3]
Khan A, Mathelier A, Zhang X. Super-enhancers are transcriptionally more active and cell type-specific than stretch enhancers. Epigenetics 2018; 13(9): 910-22.
[http://dx.doi.org/10.1080/15592294.2018.1514231] [PMID: 30169995]
[4]
Khan A, Zhang X. Integrative modeling reveals key chromatin and sequence signatures predicting super-enhancers. Sci Rep 2019; 9(1): 2877.
[http://dx.doi.org/10.1038/s41598-019-38979-9] [PMID: 30814546]
[5]
Bu H, Hao J, Gan Y, Zhou S, Guan J. DEEPSEN: A convolutional neural network based method for super-enhancer prediction. BMC Bioinformatics 2019; 20(S15)(15): 598.
[http://dx.doi.org/10.1186/s12859-019-3180-z] [PMID: 31874597]
[6]
Ji QY, Gong XJ, Li HM, Du PF, Deep SE. Detecting super-enhancers among typical enhancers using only sequence feature embeddings. Genomics 2021; 113(6): 4052-60.
[http://dx.doi.org/10.1016/j.ygeno.2021.10.007] [PMID: 34666191]
[7]
Karolchik D, Baertsch R, Diekhans M, et al. The UCSC genome browser database. Nucleic Acids Res 2003; 31(1): 51-4.
[http://dx.doi.org/10.1093/nar/gkg129] [PMID: 12519945]
[8]
Chen Z, Zhao P, Li C, et al. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res 2021; 49(10): e60.
[http://dx.doi.org/10.1093/nar/gkab122] [PMID: 33660783]
[9]
Lv H, Dao FY, Zulfiqar H, et al. A sequence-based deep learning approach to predict CTCF-mediated chromatin loop. Brief Bioinform 2021; 22(5): bbab031.
[http://dx.doi.org/10.1093/bib/bbab031] [PMID: 33634313]
[10]
Zhou Y, Zeng P, Li YH, Zhang Z, Cui Q. SRAMP: prediction of mammalian N 6 -methyladenosine (m 6 A) sites based on sequence-derived features. Nucleic Acids Res 2016; 44(10): e91.
[http://dx.doi.org/10.1093/nar/gkw104] [PMID: 26896799]
[11]
Chen K, Kurgan LA, Ruan J. Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs. BMC Struct Biol 2007; 7(1): 25.
[http://dx.doi.org/10.1186/1472-6807-7-25] [PMID: 17437643]
[12]
Chen Z, Chen YZ, Wang XF, Wang C, Yan RX, Zhang Z. Prediction of ubiquitination sites by using the composition of k-spaced amino acid pairs. PLoS One 2011; 6(7): e22930.
[http://dx.doi.org/10.1371/journal.pone.0022930] [PMID: 21829559]
[13]
Chen W, Lei TY, Jin DC, Lin H, Chou KC. PseKNC: A flexible web server for generating pseudo K-tuple nucleotide composition. Anal Biochem 2014; 456: 53-60.
[http://dx.doi.org/10.1016/j.ab.2014.04.001] [PMID: 24732113]
[14]
Liu B, Liu F, Fang L, Wang X, Chou KC. repDNA: A Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 2015; 31(8): 1307-9.
[http://dx.doi.org/10.1093/bioinformatics/btu820] [PMID: 25504848]
[15]
Fukunaga K, Hostetler L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inf Theory 1975; 21(1): 32-40.
[http://dx.doi.org/10.1109/TIT.1975.1055330]
[16]
Cheng YZ. Mean shift, mode seeking, and clustering. IEEE Trans Pattern Anal Mach Intell 1995; 17(8)
[17]
Zhou TT, Zhang HB, Pan JL. The application of three clustering algorithms in building image segmentation. Mod Comp 2017; (2): 76-80.
[18]
Shi SQ, Zhang LP, Zhang BS. Non-intrusive load identification method based on Mean-shift clustering and Siamese network. Elect Transm 2022; 52(24): 67-74.
[19]
Carreira-Perpinan MA. Acceleration strategies for gaussian mean-shift image segmentation. IEEE Comp Soc Conf Comp Vis Patt Recog 2006; 1160-7.
[http://dx.doi.org/10.1109/CVPR.2006.44]
[20]
Carreira-Perpiñán MA. Gaussian mean-shift is an EM algorithm. IEEE Trans Pattern Anal Mach Intell 2007; 29(5): 767-76.
[http://dx.doi.org/10.1109/TPAMI.2007.1057] [PMID: 17356198]
[21]
Aliyari Ghassabeh Y. A sufficient condition for the convergence of the mean shift algorithm with Gaussian kernel. J Multivariate Anal 2015; 135: 1-10.
[http://dx.doi.org/10.1016/j.jmva.2014.11.009]
[22]
Comaniciu D, Meer P. Mean shift: A robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 2002; 24(5): 603-19.
[http://dx.doi.org/10.1109/34.1000236]
[23]
Zhang P, Zhang H, Wu H. iPro-WAEL: A comprehensive and robust framework for identifying promoters in multiple species. Nucleic Acids Res 2022; 50(18): 10278-89.
[http://dx.doi.org/10.1093/nar/gkac824] [PMID: 36161334]
[24]
Cai L, Ren X, Fu X, Peng L, Gao M, Zeng X. iEnhancer-XG: Interpretable sequence-based enhancers and their strength predictor. Bioinformatics 2021; 37(8): 1060-7.
[http://dx.doi.org/10.1093/bioinformatics/btaa914] [PMID: 33119044]
[25]
Chen T, Guestrin C. XGBoost: A scalable tree boosting system. Proc 22nd ACM SIGKDD Int Conf Knowl Discov Data Min 785-94.
[http://dx.doi.org/10.1145/2939672.2939785]
[26]
Ganaie MA, Hu M, Malik AK, Tanveer M, Suganthan PN. Ensemble deep learning: A review. Eng Appl Artif Intell 2022; 115: 105151.
[http://dx.doi.org/10.1016/j.engappai.2022.105151]
[27]
Laurens VDM, Hinton G. Visualizing data using t-SNE. J Mach Learn Res 2008; 9(2605): 2579-605.

Rights & Permissions Print Cite
© 2025 Bentham Science Publishers | Privacy Policy