AUB ScholarWorks

Transfer Learning Approach to Developing Large Scale Lexicon for Resource Constrained Languages

Show simple item record

dc.contributor.advisor Hajj, Hazem
dc.contributor.author Maarouf, Alaa
dc.date.accessioned 2021-09-15T04:53:22Z
dc.date.available 2021-09-15T04:53:22Z
dc.date.issued 9/15/2021
dc.date.submitted 9/14/2021
dc.identifier.uri http://hdl.handle.net/10938/23012
dc.description.abstract Lexical resources often form critical components in computational models for natural language processing (NLP). As a result, the pace of advances in NLP for resource-constrained languages, like Arabic, is slow due to limited resources compared to English large-scale resources such as English WordNet (EWN) which contains a rich set of semantics and relations between words. Despite progress to overcome this challenge, lexical resources for non-English languages remain limited in size and in accuracy of the semantics. In this thesis, we aim to overcome these limitations of size and accuracy in lexical resources by developing a method that generates a large-scale lexicon with rich semantics by transferring knowledge from a small lexical resource that has been reliably linked to EWN. Starting from a large-scale lexicon in the resource-constrained language without prior connections to EWN, the method aims at developing accurate links between the terms in the lexicon and EWN, thus creating the desired large-scale lexicon. While previous work had explored the link prediction problem through shallow links with limited accuracy, we focus on developing links based on deeper word semantics. We combine deep learning models with feature-based machine learning models that can benefit from the rich semantics within EWN. We propose a boosting three-step approach where we first apply transfer-learning by fine-tuning a BERT-based language model built for the low-resource constrained language followed by a decision tree classifier that uses the EWN semantics, and finally applying back-off prediction for terms with missing EWN semantics using Multilingual Universal Sentence Encoder (MUSE). The classifier predicts a link between two terms based on information from relations between the equivalent synsets in EWN using the depth of senses in the taxonomy, the number of edges separating the synsets, and the hypernym information within the is-a relationships between synsets. The first step in the boosting method aims to achieve high recall and the other two steps aim to improve precision. The proposed method is tested on Arabic to create a large-scale Arabic lexicon by predicting links between Standard Arabic Morphological Analyzer (SAMA) and EWN. For the small-scale lexicon with previously established reliable connections to EWN, we use Arabic WordNet (AWN). Compared to state-of-the-art ArSenL 2.0, the test results showed relative performance improvements in the accuracy of links with 4.1% F1 for nouns, 14.5% for verbs, and 19.1% for adjectives.
dc.language.iso en
dc.subject transfer learning
dc.subject language model
dc.subject link prediction
dc.subject arabic wordnet expansion
dc.subject arabic natural language processing
dc.subject lexical resources
dc.subject arabic sentiment lexicon
dc.subject wordnet
dc.subject word semantics
dc.subject machine learning
dc.subject deep learning
dc.subject artificial intelligence
dc.title Transfer Learning Approach to Developing Large Scale Lexicon for Resource Constrained Languages
dc.type Thesis
dc.contributor.department Department of Electrical and Computer Engineering
dc.contributor.faculty Maroun Semaan Faculty of Engineering and Architecture
dc.contributor.institution American University of Beirut
dc.contributor.commembers Elhajj, Imad
dc.contributor.commembers Habash, Nizar
dc.contributor.degree ME
dc.contributor.AUBidnumber 201820749


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search AUB ScholarWorks


Browse

My Account