Abstract
Background: The post-translational modifications (PTMs) on the side chains of conserved lysine (Lys) residues play important roles in myriad cellular processes, such as modification of the structures and activities of histones, protein degradation and turnover, and the regulation of DNA damage responses. To date, several computational methods have been developed to identify different PTMs on Lys residues. However, most of these methods focused on identifying one particular PTM regardless of other types of PTMs.
Method: In this study, we first conducted a computational investigation of three types of PTMs (acetylation, sumoylation, and ubiquitination) at the same time by analyzing the protein structure and sequence factors surrounding the substrate Lysresidues in these types of PTMs. To fully extract the structural and sequence information around the Lysresidues, six types of features were used to encode the peptide segments containing the substrates. Next, through a feature selection method, i.e., maximum relevance minimum redundancy (mRMR), two feature lists, i.e., MaxRel feature list and mRMR feature list, were obtained. For the mRMR feature list, it was applied to extract the optimal features of the random forest algorithm for distinguishing three types of PTMs.
Results: An optimal classification model with an overall accuracy of 0.989 was built. For the MaxRel feature list, we investigated the top-ranked features to uncover the site-preference and residue-preference of Lys residues.
Conclusion: The results suggested that the disorder structure and the preference of flanking residues were the most important attributes to distinguish the three types of PTMs, which were consistent with the results reported in previous studies.
Keywords: Post-translational modification, acetylation, sumoylation, ubiquitination, maximum relevance minimum redundancy, random forest, disordered region in protein.