Abstract
Protein signal peptides play a vital role in targeting and translocation of most secreted proteins and many integral membrane proteins in both prokaryotes and eukaryotes. Consequently, accurate prediction of signal peptides and their cleavage sites is an important task in molecular biology. In the present study, firstly, we develop a novel discriminative scoring method for classifying proteins with or without signal peptides. This method successfully captured the characteristics of signal peptides and non-signal peptides by integrating hydrophobicity alignment and positionspecific amino acid propensities based on the highest average positions. As a result, this method is capable of discriminating proteins with signal peptides at the overall accuracies of 96.3%, 97.0% and 97.2% by leave-one-out jackknife tests on the constructed benchmark datasets for three different organisms, i.e. Eukaryotic, Gram-negative, and Gram-positive respectively. Secondly, we consider the prediction task of signal peptide cleavage sites as a sequence labeling problem and apply Conditional Random Fields (CRFs) algorithm to solve it. Experimental results demonstrate that the proposed CRFs-based cleavage site finding approach can achieve the prediction success rates of 80.8%, 89.4%, and 74.0% respectively, for the secretory proteins from three different organisms. An online tool, LnSignal, is established for labeling the N-terminal signal cleavage sites and is freely available for academic use at http: //www.csbio.sjtu.edu.cn/bioinf/LnSignal.
Keywords: Conditional random fields (CRFs), hydrophobicity alignment, position-specific amino acid propensities, secretory protein, signal peptide, Eukaryotic, optimal thresholds, hydrophobicity, machine learning technique, frequency correction