Predicting Birth Defects Using Cost Sensitive Machine Learning

Hamandi, Ahmad

AUB ScholarWorks Home
→
Students Publications
→
AUB Students' Theses, Dissertations, and Projects
→
View Item

dc.contributor.advisor	Abu Salem, Fatima
dc.contributor.author	Hamandi, Ahmad
dc.date.accessioned	2021-02-07T12:21:44Z
dc.date.available	2021-02-07T12:21:44Z
dc.date.issued	2/7/2021
dc.identifier.uri	http://hdl.handle.net/10938/22220
dc.description.abstract	Many studies were made to tackle the issue of birth defects. Most of them focus on medical causes only like consanguinity degree, folic acid intake, diabetes, etc. A set of studies were made on the effect of ambient air pollution on the health of newborns. Studies that involve the use of artificial intelligence to detect birth defects from ambient air pollution are rare. In my thesis, I use data science and machine learning to build a tool that predicts birth defects from ambient air pollution and medical data. In our study, we used several techniques to build trustable and interpretable predictions for imbalanced data. To tackle the issue of imbalanced data, we used several techniques. One technique was to perform data sampling; in this technique, we balance the data before performing any learning process, this technique was beneficial for many models; for instance the performance of logistic regression was improved when using oversampling techniques and F2 score recorded a 5% improvement. Another technique is called cost-sensitive learning, in this technique, we use specific algorithms that can perform modeling for imbalanced data. Also, we performed feature selection to identify which features are the most important features for our study, and we were able to identify several features in the feature selection process that were confirmed through a process called SHAP. Feature selection is a process that reduces the number of features in a machine learning process to enhance the modeling performance. SHAP is a technique used to highlight the contribution of each feature in a specific prediction. Our main focus was to predict the probability of having a birth defect and to export trustable and explainable results to the end-user. After comparing several models using several configurations, we found that cost sensitive logistic regression and support vector machines were the best performing ones. Cost sensitive logistic regression was the best for performing both classification and probability prediction. Support vector machines was the most similar model in terms of performance. Cost sensitive logistic regression recorded an F2 score of 93.46% on the training data when performing classification. For probability prediction, cost sensitive logistic regression recorded a Brier Skill Score of 74.23%, and Support Vector Machines recorded a Brier Skill Score of 69.21%. Baseline Brier Skill Score is 5%, F2 score recorded 0% as a baseline performance by uniformly classifying all instances to the majority class. We were able to identify that cost sensitive logistic regression is the best in terms of training time and ease-of-use; it is faster and less biased compared to other models, cost sensitive logistic regression tends to make less mistakes over frequent patterns of the data. Ease-of-use is defined here as a model that can predict birth defects with a smaller number of features and that can perform early prediction for birth defects during the first few weeks of pregnancy, some of these features are consanguinity degree, mother age, folic acid consumption before pregnancy and chronic disease, BMI, exposure to AIR pollutants prior window of risk and during window of risk. Also, all models in our study revealed same or better performance when running them on selected features, therefore, in terms of ease-of-use they are all aligned. Also, SHAP revealed the same trustworthiness for both selected models. SHAP highlights the contribution of each feature in the prediction process, some of these contributions were aligned with the literature, which gives more trust to the model that we are using, we found that features like consanguinity degree, mother age, folic acid consumption before pregnancy and chronic disease are top contributors to the decision of the models, in addition to other air pollutants, also SHAP showed the contribution of each feature with a certain direction; this means that it allowed us to detect if a specific feature will give a higher or lower probability of having a specific birth defect. For example, one of the findings of SHAP is that it showed that consumption of folic acid intake before pregnancy will lead to lower risk of getting a birth defect, higher mother age and lower mother education will lead to higher probability of getting a birth defect, which is aligned with the literature.
dc.language.iso	en
dc.subject	birth defects
dc.subject	data science
dc.subject	machine learning
dc.subject	environmental health
dc.subject	medicine
dc.subject	public health
dc.subject	imbalanced data
dc.subject	cost sensitive machine learning
dc.title	Predicting Birth Defects Using Cost Sensitive Machine Learning
dc.type	Thesis
dc.contributor.department	Department of Computer Science
dc.contributor.faculty	Faculty of Arts and Sciences
dc.contributor.institution	American University of Beirut
dc.contributor.commembers	Nassar, Mohamed El Baker
dc.contributor.commembers	Yunis, Khalid
dc.contributor.commembers	Dhaini, Hassan

Files in this item

Name: HamandiA_2021.pdf

Size: 20.07Mb

Format: PDF

View/Open

This item appears in the following Collection(s)

AUB Students' Theses, Dissertations, and Projects [12709]

Show simple item record

Search AUB ScholarWorks

Browse

All of AUB ScholarWorks
This Collection
- By Issue Date
- Authors
- Titles
- Subjects

My Account

Copyright Statement

All materials included in the institutional repository are protected by copyright laws and are the property of their respective copyright holders. Materials may be used for non-commercial, educational, or research purposes only, and must be cited or attributed to the original source. Permission for any other use must be obtained from the copyright holder(s) directly. The American University of Beirut Libraries does not assume responsibility for any infringement of copyright laws that may occur as a result of the use of materials in the repository. If you believe that your copyright has been infringed upon in the repository, please contact the AUB Libraries immediately.

For further information, please contact us at scholarworks@aub.edu.lb