Diacritic-based matching of Arabic words

dc.contributor.authorJarrar, Mustafa
dc.contributor.authorZaraket, Fadi A.
dc.contributor.authorAsia, Rami
dc.contributor.authorAmayreh, Hamzeh
dc.contributor.departmentDepartment of Electrical and Computer Engineering
dc.contributor.facultyMaroun Semaan Faculty of Engineering and Architecture (MSFEA)
dc.contributor.institutionAmerican University of Beirut
dc.date.accessioned2025-01-24T11:29:35Z
dc.date.available2025-01-24T11:29:35Z
dc.date.issued2018
dc.description.abstractWords in Arabic consist of letters and short vowel symbols called diacritics inscribed atop regular letters. Changing diacritics may change the syntax and semantics of a word; turning it into another. This results in difficulties when comparing words based solely on string matching. Typically, Arabic NLP applications resort to morphological analysis to battle ambiguity originating from this and other challenges. In this article, we introduce three alternative algorithms to compare two words with possibly different diacritics. We propose the Subsume knowledge-based algorithm, the Imply rule-based algorithm, and the Alike machine-learning-based algorithm. We evaluated the soundness, completeness, and accuracy of the algorithms against a large dataset of 86,886 word pairs. Our evaluation shows that the accuracy of Subsume (100%), Imply (99.32%), and Alike (99.53%). Although accurate, Subsume was able to judge only 75% of the data. Both Subsume and Imply are sound, while Alike is not. We demonstrate the utility of the algorithms using a real-life use case - in lemma disambiguation and in linking hundreds of Arabic dictionaries. © 2018 Association for Computing Machinery.
dc.identifier.doihttps://doi.org/10.1145/3242177
dc.identifier.eid2-s2.0-85058794589
dc.identifier.urihttp://hdl.handle.net/10938/27265
dc.language.isoen
dc.publisherAssociation for Computing Machinery
dc.relation.ispartofACM Transactions on Asian and Low-Resource Language Information Processing
dc.sourceScopus
dc.subjectArabic
dc.subjectDiacritics
dc.subjectDisambiguation
dc.subjectLearning systems
dc.subjectSemantics
dc.subjectAlternative algorithms
dc.subjectKnowledge-based algorithms
dc.subjectMorphological analysis
dc.subjectRule based algorithms
dc.subjectString matching
dc.subjectKnowledge based systems
dc.titleDiacritic-based matching of Arabic words
dc.typeArticle

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
2018-7838.pdf
Size:
5.82 MB
Format:
Adobe Portable Document Format