AUB ScholarWorks

Predicting Birth Defects Using Cost Sensitive Machine Learning

Show simple item record

dc.contributor.advisor Abu Salem, Fatima
dc.contributor.author Hamandi, Ahmad
dc.date.accessioned 2021-02-07T12:21:44Z
dc.date.available 2021-02-07T12:21:44Z
dc.date.issued 2/7/2021
dc.identifier.uri http://hdl.handle.net/10938/22220
dc.description.abstract Many studies were made to tackle the issue of birth defects. Most of them focus on medical causes only like consanguinity degree, folic acid intake, diabetes, etc. A set of studies were made on the effect of ambient air pollution on the health of newborns. Studies that involve the use of artificial intelligence to detect birth defects from ambient air pollution are rare. In my thesis, I use data science and machine learning to build a tool that predicts birth defects from ambient air pollution and medical data. In our study, we used several techniques to build trustable and interpretable predictions for imbalanced data. To tackle the issue of imbalanced data, we used several techniques. One technique was to perform data sampling; in this technique, we balance the data before performing any learning process, this technique was beneficial for many models; for instance the performance of logistic regression was improved when using oversampling techniques and F2 score recorded a 5% improvement. Another technique is called cost-sensitive learning, in this technique, we use specific algorithms that can perform modeling for imbalanced data. Also, we performed feature selection to identify which features are the most important features for our study, and we were able to identify several features in the feature selection process that were confirmed through a process called SHAP. Feature selection is a process that reduces the number of features in a machine learning process to enhance the modeling performance. SHAP is a technique used to highlight the contribution of each feature in a specific prediction. Our main focus was to predict the probability of having a birth defect and to export trustable and explainable results to the end-user. After comparing several models using several configurations, we found that cost sensitive logistic regression and support vector machines were the best performing ones. Cost sensitive logistic regression was the best for performing both classification and probability prediction. Support vector machines was the most similar model in terms of performance. Cost sensitive logistic regression recorded an F2 score of 93.46% on the training data when performing classification. For probability prediction, cost sensitive logistic regression recorded a Brier Skill Score of 74.23%, and Support Vector Machines recorded a Brier Skill Score of 69.21%. Baseline Brier Skill Score is 5%, F2 score recorded 0% as a baseline performance by uniformly classifying all instances to the majority class. We were able to identify that cost sensitive logistic regression is the best in terms of training time and ease-of-use; it is faster and less biased compared to other models, cost sensitive logistic regression tends to make less mistakes over frequent patterns of the data. Ease-of-use is defined here as a model that can predict birth defects with a smaller number of features and that can perform early prediction for birth defects during the first few weeks of pregnancy, some of these features are consanguinity degree, mother age, folic acid consumption before pregnancy and chronic disease, BMI, exposure to AIR pollutants prior window of risk and during window of risk. Also, all models in our study revealed same or better performance when running them on selected features, therefore, in terms of ease-of-use they are all aligned. Also, SHAP revealed the same trustworthiness for both selected models. SHAP highlights the contribution of each feature in the prediction process, some of these contributions were aligned with the literature, which gives more trust to the model that we are using, we found that features like consanguinity degree, mother age, folic acid consumption before pregnancy and chronic disease are top contributors to the decision of the models, in addition to other air pollutants, also SHAP showed the contribution of each feature with a certain direction; this means that it allowed us to detect if a specific feature will give a higher or lower probability of having a specific birth defect. For example, one of the findings of SHAP is that it showed that consumption of folic acid intake before pregnancy will lead to lower risk of getting a birth defect, higher mother age and lower mother education will lead to higher probability of getting a birth defect, which is aligned with the literature.
dc.language.iso en
dc.subject birth defects
dc.subject data science
dc.subject machine learning
dc.subject environmental health
dc.subject medicine
dc.subject public health
dc.subject imbalanced data
dc.subject cost sensitive machine learning
dc.title Predicting Birth Defects Using Cost Sensitive Machine Learning
dc.type Thesis
dc.contributor.department Department of Computer Science
dc.contributor.faculty Faculty of Arts and Sciences
dc.contributor.institution American University of Beirut
dc.contributor.commembers Nassar, Mohamed El Baker
dc.contributor.commembers Yunis, Khalid
dc.contributor.commembers Dhaini, Hassan


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search AUB ScholarWorks


Browse

My Account