Detecting Hate Speech Across Arabic Dialects

Harba, Sara

AUB ScholarWorks Home
→
Students Publications
→
AUB Students' Theses, Dissertations, and Projects
→
View Item

Detecting Hate Speech Across Arabic Dialects

Harba, Sara

URI: http://hdl.handle.net/10938/23394

Date: 5/16/2022

Abstract:

With the ever-increasing adoption of social network platforms, online hate speech has become a pressing and growing issue. Hate speech detection in English is attracting more and more attention, and some detection systems have shown some successful results. In contrast, hate speech detection in Arabic is still faced with various challenges mainly due to the wide variety of Arabic dialects. The main goal of this work is to build an accurate speech detection system that can generalize well across different Arabic dialects. Therefore, we conduct an extensive analysis of various preprocessing techniques (e.g., stemming, lemmatization, and emojis translation), feature extraction techniques (e.g., frequency-based and word embeddings), classification models (including Logistic Regression and Support Vector Machine), and combination techniques (at the data, feature, and model level). We fine-tune Bert models and optimize their hyperparameters for our detection tasks. Our experiments include six datasets containing different dialects and three datasets with Levantine dialect, Tunisian dialect, and a combination of several dialects. 80% of each of the six datasets is combined and used for model training and validation, while the remaining part is used for modelV¶ evaluation. The three remaining datasets are kept for testing the generalization of our best models. The results on our test sets indicate that the scores combination of three models, logistic regression using (unigram) term frequency inverse document frequency (TF-IDF), logistic regression using AraVec word embedding features, and support vector machine using TF-IDF, achieves a good detection performance across all test sets, with area under the curve (AUC) of 84%, 89%, and 78% on the three unseen datasets. IQ aGGLWLRQ, ZH ILQG WKaW XVLQJ OHPPaWL]aWLRQ aQG cRQVLGHULQJ HPRMLV¶ meanings have a considerable impact on the results. Pre-trained AraBert model outperforms all other trained models with higher generalization performance and AUC scores of 91%, 93%, and 85% on the unseen datasets. The results denote that the same models' combination and AraBert are robust to data imbalance and achieve a relatively good generalization performance.

Advisor(s):

Khreich, Wael

Show full item record

Files in this item

Name: HarbaSara_2022.pdf

Size: 1.585Mb

Format: PDF

View/Open

This item appears in the following Collection(s)

AUB Students' Theses, Dissertations, and Projects [12714]

Search AUB ScholarWorks

Browse

All of AUB ScholarWorks
This Collection
- By Issue Date
- Authors
- Titles
- Subjects

My Account

Copyright Statement

All materials included in the institutional repository are protected by copyright laws and are the property of their respective copyright holders. Materials may be used for non-commercial, educational, or research purposes only, and must be cited or attributed to the original source. Permission for any other use must be obtained from the copyright holder(s) directly. The American University of Beirut Libraries does not assume responsibility for any infringement of copyright laws that may occur as a result of the use of materials in the repository. If you believe that your copyright has been infringed upon in the repository, please contact the AUB Libraries immediately.

For further information, please contact us at scholarworks@aub.edu.lb

Detecting Hate Speech Across Arabic Dialects

Detecting Hate Speech Across Arabic Dialects

Abstract:

Advisor(s):

Files in this item

This item appears in the following Collection(s)

Search AUB ScholarWorks

Browse

All of AUB ScholarWorks

This Collection

My Account

Copyright Statement