Abstract:
Open-domain dialogue agents are systems that can converse with users on any topic of user’s choice. Having such types of agents has been a long standing objective in Artificial Intelligence as they can make the human-computer interaction experience much more engaging. Recent advances in English open-domain dialogue have leveraged state-of-the-art Large Language Models (LLMs) for Natural Language Generation (NLG). Such LLMs are massively pre-trained on unlabeled data in a self-supervised mode to learn abstract representations of the language. They also require large amounts of labeled open-domain dialogue data for fine-tuning to achieve the challenging task of dialogue response generation. In low-resource settings such as Arabic and its dialects, such pre-trained LLMs and large labeled dialogue datasets are often non-existent, hindering the development of open-domain chatbots for those languages. Such limited resource modeling problem is known as the few-shot learning problem.
In this thesis, we address multiple aspects of the few-shot learning problem for open-domain Arabic conversational bots. The first contribution is a solution to overcome the unavailability of LLMs with large amounts of labeled dialogue data for Arabic MSA. To address the response generation problem, we propose a model that transfers knowledge from a pre-trained BERT encoder to an encoder-decoder model for dialogue response generation. The second contribution addresses a more extreme case of limited resources with Arabic dialects. To address the LLM and NLG challenges for Arabic dialects, we propose a three-stage learning framework based on warm-starting, self-supervised pre-training, and few-shot fine-tuning. The third contribution focuses on addressing the challenge of ensuring generated responses are relevant to user’s query for both English and Arabic. We propose a new decoding algorithm that considers increased samples in response generation then chooses the response with highest similarity to user’s query. The fourth contribution is in the development of new data resources for Arabic with one message-response dataset in Modern Standard Arabic (MSA) and three datasets for the most widely spoken Arabic dialects (Levantine, Egyptian, and Gulf). The experiment results showed success of the proposed methods and achieved state of the art performance for Arabic open-dialogue systems.