Predicting Metabolic Syndrome for Major Depressive Disorder Patients: A Machine Learning Approach
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Metabolic syndrome is highly prevalent among patients with Major Depressive Disorder (MDD), and early identification of individuals at risk may support timely
clinical intervention. However, predictive modelling in psychiatric cohorts is challenging because clinical datasets are typically small, contain missing values, and
include heterogeneous clinical variables.
This thesis investigates the feasibility of predicting metabolic syndrome using baseline information from the METADAP cohort. Several machine learning models were
evaluated, including CatBoost, XGBoost, Random Forest, TabNet, Logistic Regression, and a multilayer perceptron. To examine how data preprocessing influences
predictive performance, three experimental setups were designed: (1) imputing missing values using MissForest, (2) preserving missing values and relying on models that
handle them natively, and (3) removing observations with missing values. Model
performance was assessed using cross-validation and multiple evaluation metrics. In
addition to predictive performance, feature importance and subgroup error analyses
were conducted to explore model behaviour.
The results show that predictive performance varies considerably depending on the
handling of missing data. CatBoost achieved the most stable results when missing
values were preserved, reaching an AUC of 0.8327 and an F1 score of 0.7143. In the
complete-case setup, TabNet achieved the highest performance, reaching an AUC
of 0.9196 and an F1 score of 0.6667 in the best cross-validation fold. Overall, the
findings demonstrate that preprocessing strategies can substantially influence model
performance in small psychiatric datasets.
These results highlight the importance of carefully evaluating modelling and preprocessing strategies when applying machine learning to clinical cohort data.