Predicting Metabolic Syndrome for Major Depressive Disorder Patients: A Machine Learning Approach

Abstract

Metabolic syndrome is highly prevalent among patients with Major Depressive Disorder (MDD), and early identification of individuals at risk may support timely clinical intervention. However, predictive modelling in psychiatric cohorts is challenging because clinical datasets are typically small, contain missing values, and include heterogeneous clinical variables. This thesis investigates the feasibility of predicting metabolic syndrome using baseline information from the METADAP cohort. Several machine learning models were evaluated, including CatBoost, XGBoost, Random Forest, TabNet, Logistic Regression, and a multilayer perceptron. To examine how data preprocessing influences predictive performance, three experimental setups were designed: (1) imputing missing values using MissForest, (2) preserving missing values and relying on models that handle them natively, and (3) removing observations with missing values. Model performance was assessed using cross-validation and multiple evaluation metrics. In addition to predictive performance, feature importance and subgroup error analyses were conducted to explore model behaviour. The results show that predictive performance varies considerably depending on the handling of missing data. CatBoost achieved the most stable results when missing values were preserved, reaching an AUC of 0.8327 and an F1 score of 0.7143. In the complete-case setup, TabNet achieved the highest performance, reaching an AUC of 0.9196 and an F1 score of 0.6667 in the best cross-validation fold. Overall, the findings demonstrate that preprocessing strategies can substantially influence model performance in small psychiatric datasets. These results highlight the importance of carefully evaluating modelling and preprocessing strategies when applying machine learning to clinical cohort data.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By