Abstract
Background: With the explosion of communication technologies and the accompanying pervasive use of social media, we notice an outstanding proliferation of posts, reviews, comments, and other forms of expressions in different languages. This content attracted researchers from different fields; economics, political sciences, social sciences, psychology and particularly language processing. One of the prominent subjects is the discrimination between similar languages and dialects using natural language processing and machine learning techniques. The problem is usually addressed by formulating the identification as a classification task.
Methods: The approach is based on machine learning classification methods to discriminate between Modern Standard Arabic (MSA) and four regional Arabic dialects: Egyptian, Levantine, Gulf and North-African. Several models were trained to discriminate between the studied dialects in large corpora mined from online Arabic newspapers and manually annotated.
Results: Experimental results showed that n-gram features could substantially improve performance. Logistic regression based on character and word n-gram model using Count Vectors identified the handled dialects with an overall accuracy of 95%. Best results were achieved with Linear Support vector classifier using TF-IDF Vectors trained by character-based uni-gram, bi-gram, trigram, and word-based uni-gram, bi-gram with an overall accuracy of 95.1%.
Conclusion: The results showed that n-gram features could substantially improve performance. Additionally, we noticed that the kind of data representation could provide a significant performance boost compared to simple representation.
Keywords: Computational linguistics, dialects identification, social media, machine learning, Arabic, logistic regression.
Graphical Abstract
[http://dx.doi.org/10.1162/COLI_a_00169]
[http://dx.doi.org/10.3115/1621774.1621784]
[http://dx.doi.org/10.21437/Odyssey.2018-14]
[http://dx.doi.org/10.3115/v1/W14-3911]
[http://dx.doi.org/10.3115/v1/W14-5313]
[http://dx.doi.org/10.3115/v1/D14-1154]
[http://dx.doi.org/10.18653/v1/W17-1306]
[http://dx.doi.org/10.18653/v1/W17-1222]
[http://dx.doi.org//10.18653/v1/W17-1201]
[http://dx.doi.org/10.1145/2632188.2632207]
[http://dx.doi.org/10.1002/cplx.21465]
[http://dx.doi.org/10.1088/1751-8113/48/39/395101]