Improving Arabic Text Classification Using P-Stemmer

Tarek       Kanan; Bilal       Hawashin; Shadi       Alzubi; Eyad       Almaita; Ahmad       Alkhatib; Khulood    Abu    Maria; Mohammed       Elbes

doi:10.2174/2666255813999200904114023

Abstract

Introduction: Stemming is an important preprocessing step in text classification, and could contribute to increasing text classification accuracy. Although many works have proposed stemmers for the English language, few stemmers have been proposed for Arabic text. Arabic language has gained increasing attention in the previous decades and the need to further improve Arabic text classification.

Methods: This work combined the use of the recently proposed P-stemmer with various classifiers to find the optimal classifier for the P-stemmer in terms of Arabic text classification. As part of this work, a synthesized dataset was collected.

Results: The previous experiments show that the use of P-stemmer has a positive effect on classification. The degree of improvement is classifier-dependent, which is reasonable as classifiers vary in their methodologies. Moreover, the experiments show that the best classifier with the P-Stemmer is NB. This is an interesting result as this classifier is well-known for its fast learning and classification time.

Discussion: First, the continuous improvement of the P-stemmer by more optimization steps is necessary to further improve the Arabic text categorization. This can be made by combining more classifiers with the stemmer, by optimizing the other natural language processing steps, and by improving the set of stemming rules. Second, the lack of sufficient Arabic datasets, especially large ones, is still an issue.

Conclusion: In this work, an improved P-stemmer was proposed by combining its use with various classifiers. In order to evaluate its performance, and due to the lack of Arabic datasets, a novel Arabic dataset was synthesized from various online news pages. Next, the P-stemmer was combined with Naïve Bayes, Random Forest, Support Vector Machines, K-Nearest Neighbor, and K-Star.

Keywords: Natural language processing, machine learning, arabic text classification, stemming, p-stemmer, news articles.

Graphical Abstract

[1] 
A. Chen,  and C. G. Fredric, "Building an Arabic stemmer for information retrieval", In Trec, 2002pp. 631-639 
[2] 
A.T. Freeman, S.L. Condon,  and C.M. Ackerman, "Cross linguistic name matching in English and Arabic: A one to many mapping extensions of the levenshtein edit distance algorithm", In Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, 2006pp. 471-478 
[http://dx.doi.org/10.3115/1220835.1220895] 
[3] 
M. Tayli,  and A.I. Al-Salamah, "Building bilingual microcomputer systems", Commun. ACM, vol. 33, no. 5, pp. 495-504, 1990.http://doi.acm.org/10.1145/78607.78610
[http://dx.doi.org/10.1145/78607.78610] 
[4] 
S. Khoja,  and R. Garside, Stemming Arabic Text., Computing Department, Lancaster University: Lancaster, UK, 1999.
[5] 
M.F. Porter, Snowball: A language for stemming algorithms.http://snowball.tartarus.org/texts/introduction.html
[6] 
L.S. Larkey, L. Ballesteros,  and M.E. Connell, "Improving stemming for arabic information retrieval: Light stemming and co-occurrence analysis", In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tampere, Finland, 2002pp. 275-282 http://doi.acm.org/10.1145/564376.564425
[http://dx.doi.org/10.1145/564376.564425] 
[7] 
T. Kanan,  and E.A. Fox, "Automated arabic text classification with P-Stemmer, machine learning, and a tailored news article taxonomy", J. Assoc. Inf. Sci. Technol., vol. 67, no. 11, pp. 2667-2683, 2016.
[http://dx.doi.org/10.1002/asi.23609] 
[8] 
S.J. Russell, S.J. Russell, P. Norvig,  and E. Davis, Artificial Intelligence: A Modern Approach., 3rd ed Upper Saddle River, NJ Pearson, 2009.https://books.google.jo/books?id=8jZBksh-bUMC
[9] 
C. Cortes,  and V. Vapnik, "Support-vector networks", Mach. Learn., vol. 20, no. 3, pp. 273-297, 1995.
[http://dx.doi.org/10.1007/BF00994018] 
[10] 
T.K. Ho, "Random decision forests", In Proceedings of the Third International Conference on Document Analysis and Recognition, vol. 1, p. 278, 1995. http://dl.acm.org/citation.cfm?id=844379.844681
[11] 
N.S. Altman, "An introduction to Kernel and Nearest-neighbor nonparametric regression", Am. Stat., vol. 46, no. 3, pp. 175-185, Aug 1992.https://www.tandfonline.com/doi/abs/10.1080/00031305.1992.10475879
[http://dx.doi.org/10.1080/00031305.1992.10475879] 
[12] 
G. John, "An instance-based learner using an entropic distance measure", In Proceedings of the Twelfth International Conference on Machine Learning, 1995pp. 108-114.https://www.sciencedirect.com/science/article/pii/B9781558603776500220
[13] 
A. Rajaraman,  and J.D. Ullman, Mining of Massive Datasets., Cambridge University Press, 2011.https://books.google.jo/books?id=OefRhZyYOb0C
[http://dx.doi.org/10.1017/CBO9781139058452] 
[14] 
M. Taher, "arabic-stop-words", GitHub Repository, 2020. [Online]. Available at: https://github.com/mohataher/arabic-stop-words
[15] 
R. Sproat, A. Black, S. Chen, S. Kumar, M. Ostendorf,  and C. Richards, Normalization of Non-Standard Words: WS ’99 Final Report., Hopkins University, 1999.
[16] 
K. Taghva, R. Elkhoury,  and J. Coombs, "Arabic stemming without A root dictionary", In Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05), vol. 1, pp. 152-157, 2005.
[17] 
M.I. Eldesouki, W. Arafa,  and K. Darwish, "Stemming techniques of Arabic language: Comparative study from the information retrieval perspective", Egypt. Comput. J., vol. 36, no. 1, pp. 30-49, 2009.
[18] 
M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann,  and I.H. Witten, "The {WEKA} data mining software: An update", SIGKDD Explor., vol. 11, no. 1, pp. 10-18, Nov 2009.
[http://dx.doi.org/10.1145/1656274.1656278] 
[19] 
Al Maany Questions and Answers, AlMaany, 2020,  Available at: https://www.almaany.com/answers [Accessed: 22-Sep-2021].
[20] 
R. Ayadi, M. Maraoui,  and M. Zrigui, "A survey of Arabic text representation and classification methods", Res. Comput. Sci., vol. 117, no. 1, pp. 51-62, Dec 2016.
[http://dx.doi.org/10.13053/rcs-117-1-4] 
[21] 
M. Hijazi, A. Zeki,  and A. Ismail, "Arabic text classification: Review study", J. Eng. Appl. Sci., vol. 11, no. 3, pp. 528-536, June 2016.
[22] 
I. Hmeidi, B. Hawashin,  and E. El-Qawasmeh, "Performance of KNN and SVM classifiers on full word Arabic articles", Adv. Eng. Inform., vol. 22, no. 1, pp. 106-111, Jan 2008.
[http://dx.doi.org/10.1016/j.aei.2007.12.001] 
[23] 
G. Kanaan, R. Al-Shalabi, S. Ghwanmeh,  and H. Al-Ma’adeed, "A comparison of text-classification techniques applied to Arabic text", J. Assoc. Inf. Sci. Technol., vol. 60, no. 9, pp. 1836-1844, July 2009.
[http://dx.doi.org/10.1002/asi.20832] 
[24] 
B. Hawashin, A. Mansour,  and S. Aljawarneh, "An efficient feature selection method for Arabic text classification", Int. J. Comput. Appl., vol. 83, no. 17, pp. 1-6, 2013.
[25] 
N. Omar, M. Albared, T. Al-Moslmi,  and A. Al-Shabi, "A comparative study of feature selection and machine learning algorithms for Arabic sentiment classification", Asia Information Retrieval Symposium, 2014pp. 429-443 
[http://dx.doi.org/10.1007/978-3-319-12844-3_37] 
[26] 
T.F. Gharib, Q. Zhu, M.B. Habib,  and Z.T. Fayed, "Arabic text classification using support vector machines", Int. J. Comput. Their Appl., vol. 16, no. 4, pp. 192-199, Dec 2009.
[27] 
L. Breiman, "Statistical modeling: The two cultures (with comments and a rejoinder by the author)", Stat. Sci., vol. 16, no. 3, pp. 199-231, Aug 2001.
[http://dx.doi.org/10.1214/ss/1009213726] 
[28] 
L. Breiman,  and A. Cutler, "Random forests", Mach. Learn., vol. 45, no. 1, pp. 5-32, Oct 2001.

Rights & Permissions Print Cite

Article Metrics

7

1

Journal Information

For Authors

For Editors

For Reviewers

Explore Articles

Open Access

Open Access Articles

For Visitors

DOI https://dx.doi.org/10.2174/2666255813999200904114023	Print ISSN 2666-2558
Publisher Name Bentham Science Publisher	Online ISSN 2666-2566

Recent Advances in Computer Science and Communications

Improving Arabic Text Classification Using P-Stemmer

Abstract Play Pause

Graphical Abstract

Related Journals

Related Books

Abstract