Building a Comprehensive Large Arabic Fact Checking Dataset Using Large Language Models

dc.contributor.advisorElbassuoni, Shady
dc.contributor.authorKhalil, Christophe
dc.contributor.commembersAssaf, Rida
dc.contributor.commembersMouawad, Amer
dc.contributor.degreeMS
dc.contributor.departmentDepartment of Computer Science
dc.contributor.facultyFaculty of Arts and Sciences
dc.contributor.institutionAmerican University of Beirut
dc.date2025
dc.date.accessioned2025-02-18T11:17:58Z
dc.date.available2025-02-18T11:17:58Z
dc.date.issued2025-02-17T22:00:00Z
dc.date.submitted2025-02-12T22:00:00Z
dc.description.abstractLarge-scale fact verification poses a significant challenge in Arabic natural language processing due to limited datasets and resources. This work introduces a new large- scale dataset for fact-checking in Modern Standard Arabic, constructed through an automated framework leveraging large language models (LLMs). We propose a three-step pipeline: (1) claim generation from Arabic Wikipedia articles with sup- porting evidence, (2) systematic claim mutation to create challenging counterfactual statements, and (3) rigorous verification and labeling. The resulting dataset com- prises 180,000 claim-evidence pairs labeled as Supported, Refuted, or Not Enough Info. Human evaluation demonstrates strong inter-annotator agreement (κ= 0.89) in Cohen’s Kappa for the Generation Task and (κ= 0.94) for the Refutation Task on our testing sample, while our baseline models achieve 87% accuracy on the verifi- cation task with respect to the expert annotator. Our approach employs specialized prompt engineering and grammatical rules to address Arabic-specific linguistic fea- tures. This provides the first large-scale benchmark for Arabic fact verification.Our methodology presents a scalable approach for developing similar resources for other low-resource languages. Through this work, we aim to advance the state of auto- mated fact verification in Arabic and provide a foundation for future research in multilingual fact-checking.
dc.identifier.urihttp://hdl.handle.net/10938/34784
dc.language.isoen
dc.subject.keywordsLarge Language Models (LLMs)
dc.subject.lcshDeep learning (Machine learning)
dc.subject.lcshArabic language--Data processing
dc.subject.lcshData sets
dc.subject.lcshNatural language processing (Computer science)
dc.subject.lcshComputational linguistics
dc.titleBuilding a Comprehensive Large Arabic Fact Checking Dataset Using Large Language Models
dc.typeThesis
local.AUBID202371962

Files