Machine Learning for Clinical Decision Support Under Limited Data

Abstract

Machine learning is increasingly being used to support clinical decision-making in tasks such as screening, diagnosis, risk stratification, prognosis, and treatment planning. However, many clinical problems do not have access to large, balanced, and representative datasets, making models more vulnerable to overfitting, unstable feature associations, and poor generalization. This thesis investigates machine learning for clinical decision support under limited data through two case studies: Autism Spectrum Disorder (ASD) screening using children's speech transcripts, and \textit{BRCA1/2} variant interpretation in data-limited populations. In the first study, linguistic features extracted from small public TalkBank transcript datasets, including Mean Length of Utterance, Mean Length of Turn Ratio, part-of-speech patterns, and demographic variables, were used to train Logistic Regression, Random Forest, and TabNet models for ASD screening. The models achieved strong performance in binary classification tasks, with accuracy exceeding 86\%, and feature analysis showed that a compact set of interpretable linguistic features retained meaningful predictive signal. In the second study, supervised models were trained on expert-curated global \textit{BRCA1/2} variant datasets using biologically grounded features while excluding population-frequency variables and externally trained meta-predictors to reduce bias and circularity. The best-performing global model was combined with population-specific anomaly detection in a Lebanese cohort, and a conservative agreement rule was used to support classification only when the global supervised model and local anomaly detector agreed, while retaining uncertainty when predictions diverged. Together, these studies show that limited data should not prevent the use of machine learning in clinically important problems, but it should shape how models are designed, evaluated, and interpreted. Across both case studies, the thesis emphasizes clinically meaningful feature engineering, interpretable model behavior, and cautious decision support, showing that machine learning can provide useful clinical support under limited data when its outputs are treated as evidence to guide, rather than replace, clinical interpretation.

Description

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By