AUB ScholarWorks

Detecting Hate Speech Across Arabic Dialects

Show simple item record

dc.contributor.advisor Khreich, Wael
dc.contributor.author Harba, Sara
dc.date.accessioned 2022-05-16T13:12:16Z
dc.date.available 2022-05-16T13:12:16Z
dc.date.issued 5/16/2022
dc.date.submitted 5/7/2022
dc.identifier.uri http://hdl.handle.net/10938/23394
dc.description.abstract With the ever-increasing adoption of social network platforms, online hate speech has become a pressing and growing issue. Hate speech detection in English is attracting more and more attention, and some detection systems have shown some successful results. In contrast, hate speech detection in Arabic is still faced with various challenges mainly due to the wide variety of Arabic dialects. The main goal of this work is to build an accurate speech detection system that can generalize well across different Arabic dialects. Therefore, we conduct an extensive analysis of various preprocessing techniques (e.g., stemming, lemmatization, and emojis translation), feature extraction techniques (e.g., frequency-based and word embeddings), classification models (including Logistic Regression and Support Vector Machine), and combination techniques (at the data, feature, and model level). We fine-tune Bert models and optimize their hyperparameters for our detection tasks. Our experiments include six datasets containing different dialects and three datasets with Levantine dialect, Tunisian dialect, and a combination of several dialects. 80% of each of the six datasets is combined and used for model training and validation, while the remaining part is used for modelV¶ evaluation. The three remaining datasets are kept for testing the generalization of our best models. The results on our test sets indicate that the scores combination of three models, logistic regression using (unigram) term frequency inverse document frequency (TF-IDF), logistic regression using AraVec word embedding features, and support vector machine using TF-IDF, achieves a good detection performance across all test sets, with area under the curve (AUC) of 84%, 89%, and 78% on the three unseen datasets. IQ aGGLWLRQ, ZH ILQG WKaW XVLQJ OHPPaWL]aWLRQ aQG cRQVLGHULQJ HPRMLV¶ meanings have a considerable impact on the results. Pre-trained AraBert model outperforms all other trained models with higher generalization performance and AUC scores of 91%, 93%, and 85% on the unseen datasets. The results denote that the same models' combination and AraBert are robust to data imbalance and achieve a relatively good generalization performance.
dc.language.iso en
dc.subject Hate Speech Detection
dc.subject Social Media
dc.subject Arabic Dialects
dc.subject Machine Learning Algorithms
dc.subject Language Models
dc.title Detecting Hate Speech Across Arabic Dialects
dc.type Thesis
dc.contributor.department Business Analytics
dc.contributor.faculty Suliman S. Olayan School of Business
dc.contributor.institution American University of Beirut
dc.contributor.commembers Khreich, Wael
dc.contributor.commembers Sammouri, Wissam
dc.contributor.degree MS
dc.contributor.AUBidnumber 202124014


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search AUB ScholarWorks


Browse

My Account