Abstract:
Static malware detection using machine learning has achieved high accuracy, but the resulting models often operate as black boxes, limiting their practical utility. This thesis addresses a key challenge in this domain, the \emph{interpretability gap}, which refers to the disconnect between a model’s low-level feature attributions and a human analyst’s need for high-level semantic understanding. In this context, an \emph{explanation} refers to a set of input features that a post-hoc interpretability method identifies as most influential in the model’s prediction. To bridge the interpretability gap, we propose a tag-based explanation framework that maps these influential features to 11 different behavioral descriptors of malware, such as ransomware and dropper, using the SOREL dataset. We investigate three widely used explainability methods: Captum (Integrated Gradients), SHAP, and LIME, to assess their alignment with these human-understandable tags. The main contributions of this thesis include: (1) proposing a novel framework that transforms post-hoc explanations into semantically meaningful behavioral tags; (2) performing a systematic comparison of XAI methods to evaluate their consistency and interpretability; and (3) demonstrating that explanation-derived feature vectors can support accurate tag inference through supervised learning. We also decompose explanation outputs into functional feature categories to support structured interpretation and downstream integration with language models. Captum emerges as the most effective explainability method, enabling tag prediction with a general accuracy of 0.97, defined as predicting at least one correct tag, and a top-1 accuracy of 0.95, where the most confidently predicted tag matches one of the true labels.