Building a Comprehensive Large Arabic Fact Checking Dataset Using Large Language Models
Abstract
Large-scale fact verification poses a significant challenge in Arabic natural language
processing due to limited datasets and resources. This work introduces a new large-
scale dataset for fact-checking in Modern Standard Arabic, constructed through
an automated framework leveraging large language models (LLMs). We propose a
three-step pipeline: (1) claim generation from Arabic Wikipedia articles with sup-
porting evidence, (2) systematic claim mutation to create challenging counterfactual
statements, and (3) rigorous verification and labeling. The resulting dataset com-
prises 180,000 claim-evidence pairs labeled as Supported, Refuted, or Not Enough
Info. Human evaluation demonstrates strong inter-annotator agreement (κ= 0.89)
in Cohen’s Kappa for the Generation Task and (κ= 0.94) for the Refutation Task
on our testing sample, while our baseline models achieve 87% accuracy on the verifi-
cation task with respect to the expert annotator. Our approach employs specialized
prompt engineering and grammatical rules to address Arabic-specific linguistic fea-
tures. This provides the first large-scale benchmark for Arabic fact verification.Our
methodology presents a scalable approach for developing similar resources for other
low-resource languages. Through this work, we aim to advance the state of auto-
mated fact verification in Arabic and provide a foundation for future research in
multilingual fact-checking.