Abstract
Background: Text mining derives information and patterns from textual data. Online social media platforms, which have recently acquired great interest, generate vast text data about human behaviors based on their interactions. This data is generally ambiguous and unstructured. The data includes typing errors and errors in grammar that cause lexical, syntactic, and semantic uncertainties. This results in incorrect pattern detection and analysis. Researchers are employing various text mining techniques that can aid in Topic Modeling, the detection of Trending Topics, the identification of Hate Speeches, and the growth of communities in online social media networks.
Objective: This review paper compares the performance of ten machine learning classification techniques on a Twitter data set for analyzing users' sentiments on posts related to airline usage.
Methods: Review and comparative analysis of Gaussian Naive Bayes, Random Forest, Multinomial Naive Bayes, Multinomial Naive Bayes with Bagging, Adaptive Boosting (AdaBoost), Optimized AdaBoost, Support Vector Machine (SVM), Optimized SVM, Logistic Regression, and Long-Short Term Memory (LSTM) for sentiment analysis.
Results: The results of the experimental study showed that the Optimized SVM performed better than the other classifiers, with a training accuracy of 99.73% and testing accuracy of 89.74% compared to other models.
Conclusion: Optimized SVM uses the RBF kernel function and nonlinear hyperplanes to split the dataset into classes, correctly classifying the dataset into distinct polarity. This, together with Feature Engineering utilizing Forward Trigrams and Weighted TF-IDF, has improved Optimized SVM classifier performance regarding train and test accuracy. Therefore, the train and test accuracy of Optimized SVM are 99.73% and 89.74% respectively. When compared to Random Forest, a marginal of 0.09% and 1.73% performance enhancement is observed in terms of train and test accuracy and 1.29% (train accuracy) and 3.63% (test accuracy) of improved performance when compared with LSTM. Likewise, Optimized SVM, gave more than 10% of enhanced performance in terms of train accuracy when compared with Gaussian Naïve Bayes, Multinomial Naïve Bayes, Multinomial Naïve Bayes with Bagging, Logistic Regression and a similar enhancement is observed with Ada- Boost and Optimized AdaBoost which are ensemble models during the experimental process. Optimized SVM also has outperformed all the classification models in terms of AUC-ROC train and test scores.
Graphical Abstract
[http://dx.doi.org/10.1109/ACCESS.2019.2906754]
[http://dx.doi.org/10.1109/ACCESS.2020.3009482]
[http://dx.doi.org/10.14445/22312803/IJCTT-V48P126]
[http://dx.doi.org/10.1109/CloudCom.2014.69]
[http://dx.doi.org/10.26599/TST.2019.9010051]
[http://dx.doi.org/10.35940/ijitee.J9498.0981119]
[http://dx.doi.org/10.5120/ijca2017915989]
[http://dx.doi.org/10.1155/2014/717092]
[http://dx.doi.org/10.1109/CVPR.2014.263]
[http://dx.doi.org/10.1109/ACCESS.2018.2885011]
[http://dx.doi.org/10.1109/ACCESS.2020.3013849]
[http://dx.doi.org/10.1109/ACCESS.2019.2944973]
[http://dx.doi.org/10.1109/ACCESS.2020.2991074]
[http://dx.doi.org/10.3390/app10186527]
[http://dx.doi.org/10.1109/CDAN.2016.7570949]
[http://dx.doi.org/10.1109/TAFFC.2017.2695607]
[http://dx.doi.org/10.1109/BIGCOMP.2017.7881759]
[http://dx.doi.org/10.1109/SECON.2016.7506752]
[http://dx.doi.org/10.1016/j.jjimei.2021.100019]
[http://dx.doi.org/10.5120/13715-1478]
[http://dx.doi.org/10.1109/TMM.2016.2515362]
[http://dx.doi.org/10.1109/TCSS.2016.2612980]
[http://dx.doi.org/10.26615/978-954-452-049-6_036]
[http://dx.doi.org/10.1177/1461444817711402]
[http://dx.doi.org/10.1109/ACCESS.2020.3006345]
[http://dx.doi.org/10.1109/ACCESS.2020.2988288]
[http://dx.doi.org/10.1007/s10489-018-1242-y]
[http://dx.doi.org/10.11591/ijeecs.v11.i1.pp294-299]
[http://dx.doi.org/10.1186/s13673-019-0205-6]
[http://dx.doi.org/10.18653/v1/S19-2086]
[http://dx.doi.org/10.1109/ACCESS.2016.2607218]
[http://dx.doi.org/10.1109/TDSC.2014.2382577]
[http://dx.doi.org/10.1109/ACCESS.2020.2983656]
[http://dx.doi.org/10.1007/978-3-319-05579-4_10]
[http://dx.doi.org/10.1109/ACCESS.2017.2672677]
[http://dx.doi.org/10.1109/ACCESS.2020.2968955]
[http://dx.doi.org/10.1109/ICSCEE.2018.8538399]
[http://dx.doi.org/10.1109/ACCESS.2019.2952127]
[http://dx.doi.org/10.1109/TCSS.2020.3042446]
[http://dx.doi.org/10.1109/DATABIA50434.2020.9190447]
[http://dx.doi.org/10.1109/ACCESS.2018.2870203]
[http://dx.doi.org/10.1049/cje.2019.12.011]
[http://dx.doi.org/10.1109/ACCESS.2020.2972632]
[http://dx.doi.org/10.22266/ijies2020.0831.29]
[http://dx.doi.org/10.1109/ICCCNT54827.2022.9984470]
[http://dx.doi.org/10.1109/LSP.2017.2690461]