Abstract
Introduction: Stemming is an important preprocessing step in text classification, and could contribute to increasing text classification accuracy. Although many works have proposed stemmers for the English language, few stemmers have been proposed for Arabic text. Arabic language has gained increasing attention in the previous decades and the need to further improve Arabic text classification.
Methods: This work combined the use of the recently proposed P-stemmer with various classifiers to find the optimal classifier for the P-stemmer in terms of Arabic text classification. As part of this work, a synthesized dataset was collected.
Results: The previous experiments show that the use of P-stemmer has a positive effect on classification. The degree of improvement is classifier-dependent, which is reasonable as classifiers vary in their methodologies. Moreover, the experiments show that the best classifier with the P-Stemmer is NB. This is an interesting result as this classifier is well-known for its fast learning and classification time.
Discussion: First, the continuous improvement of the P-stemmer by more optimization steps is necessary to further improve the Arabic text categorization. This can be made by combining more classifiers with the stemmer, by optimizing the other natural language processing steps, and by improving the set of stemming rules. Second, the lack of sufficient Arabic datasets, especially large ones, is still an issue.
Conclusion: In this work, an improved P-stemmer was proposed by combining its use with various classifiers. In order to evaluate its performance, and due to the lack of Arabic datasets, a novel Arabic dataset was synthesized from various online news pages. Next, the P-stemmer was combined with Naïve Bayes, Random Forest, Support Vector Machines, K-Nearest Neighbor, and K-Star.
Keywords: Natural language processing, machine learning, arabic text classification, stemming, p-stemmer, news articles.
Graphical Abstract
[http://dx.doi.org/10.3115/1220835.1220895]
[http://dx.doi.org/10.1145/78607.78610]
[http://dx.doi.org/10.1145/564376.564425]
[http://dx.doi.org/10.1002/asi.23609]
[http://dx.doi.org/10.1007/BF00994018]
[http://dx.doi.org/10.1080/00031305.1992.10475879]
[http://dx.doi.org/10.1017/CBO9781139058452]
[http://dx.doi.org/10.1145/1656274.1656278]
[http://dx.doi.org/10.13053/rcs-117-1-4]
[http://dx.doi.org/10.1016/j.aei.2007.12.001]
[http://dx.doi.org/10.1002/asi.20832]
[http://dx.doi.org/10.1007/978-3-319-12844-3_37]
[http://dx.doi.org/10.1214/ss/1009213726]