On the Trustworthiness of Large Language Models
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Large Language Models (LLMs) have demonstrated extraordinary capabilities across a wide array of natural language processing tasks, yet their widespread deployment introduces critical privacy and safety risks. Specifically, these models are prone to memorizing sensitive training data, which can be subsequently extracted by malicious actors, and they are also susceptible to adversarial manipulations that bypass safety alignments to generate harmful content. This thesis aims to address gaps in the evaluation of such risks. On the privacy side, it evaluates the role of Membership Inference Attacks (MIAs) within targeted data extraction pipelines, showing that the performance of MIAs is highly dependent on the threat model and attack setup. On the safety side, it provides a systematic study of jailbreak attacks in multilingual settings, analyzing their transferability across languages and highlighting weaknesses in alignment for non-standard scripts and low-resource languages.
Description
Release date : 2027-05-13.