Abstract
Background: Protein subcellular localization is closely related to its function, and also maintains highly ordered cell guarantee for normal operation of the system. Studies of protein subcellular localization are very helpful to understand the properties and functions of protein, understand the interaction between proteins and regulation mechanism, understand the pathogenesis of some diseases and develop new drug. However, the traditional biological experiments are both time consuming and costly. Therefore, development of fast and effective machine learning method for predicting protein subcellular localization is very necessary.
Method: We propose a new method about extracting features based on pseudo amino acid composition called λ-order factor method. At the same time, we combine principal component analysis with our proposed method. Thus, not only protein sequences' physicochemical properties have been considered, but also sub-sequences sort information. Meanwhile, this measure eliminates duplicate information and reduces the dimension of feature vectors. Finally, the SVM and the10-fold cross validation test are employed to predict and evaluate the method on three benchmark datasets: ZD98, ZW225 and CL317.
Results: With comprehensive comparison of the current state-of-the-art methods, the proposed method achieves superior performance. The overall successful rate of ZD98, ZW225 and CL317 datasets is 90.8%, 85.3% and 89.6%, respectively. The results show that our method has a better classification performance than others.
Conclusion: The numerical results show that our model successfully extracts the protein sequences' physicochemical information and sort information based on pseudo amino acid composition (Pse- AAC), and provides a reliable PseAAC-based method as a potential candidate for apoptosis protein subcellular localization prediction.
Keywords: Cell biology, protein subcellular localization, PseAAC, λ-order factor, PCA, SVM.
Graphical Abstract