Abstract:
With the ever-increasing adoption of social network platforms, online hate speech has
become a pressing and growing issue. Hate speech detection in English is attracting more
and more attention, and some detection systems have shown some successful results. In
contrast, hate speech detection in Arabic is still faced with various challenges mainly due to
the wide variety of Arabic dialects. The main goal of this work is to build an accurate
speech detection system that can generalize well across different Arabic dialects. Therefore,
we conduct an extensive analysis of various preprocessing techniques (e.g., stemming,
lemmatization, and emojis translation), feature extraction techniques (e.g., frequency-based
and word embeddings), classification models (including Logistic Regression and Support
Vector Machine), and combination techniques (at the data, feature, and model level). We
fine-tune Bert models and optimize their hyperparameters for our detection tasks. Our
experiments include six datasets containing different dialects and three datasets with
Levantine dialect, Tunisian dialect, and a combination of several dialects. 80% of each of
the six datasets is combined and used for model training and validation, while the
remaining part is used for modelV¶ evaluation. The three remaining datasets are kept for
testing the generalization of our best models. The results on our test sets indicate that the
scores combination of three models, logistic regression using (unigram) term frequency inverse document frequency (TF-IDF), logistic regression using AraVec word embedding
features, and support vector machine using TF-IDF, achieves a good detection performance
across all test sets, with area under the curve (AUC) of 84%, 89%, and 78% on the three
unseen datasets. IQ aGGLWLRQ, ZH ILQG WKaW XVLQJ OHPPaWL]aWLRQ aQG cRQVLGHULQJ HPRMLV¶
meanings have a considerable impact on the results. Pre-trained AraBert model
outperforms all other trained models with higher generalization performance and AUC
scores of 91%, 93%, and 85% on the unseen datasets. The results denote that the same
models' combination and AraBert are robust to data imbalance and achieve a relatively
good generalization performance.