TY - JOUR
T1 - Data Imbalance in Autism Pre-Diagnosis Classification Systems
T2 - An Experimental Study
AU - Abdelhamid, Neda
AU - Padmavathy, Arun
AU - Peebles, David
AU - Thabtah, Fadi
AU - Goulder-Horobin, Daymond
PY - 2020/3/11
Y1 - 2020/3/11
N2 - Machine learning (ML) is a branch of computer science that is rapidly gaining popularity within the healthcare arena due to its ability to explore large datasets to discover useful patterns that can be interepreted for decision-making and prediction. ML techniques are used for the analysis of clinical parameters and their combinations for prognosis, therapy planning and support and patient management and wellbeing. In this research, we investigate a crucial problem associated with medical applications such as autism spectrum disorder (ASD) data imbalances in which cases are far more than just controls in the dataset. In autism diagnosis data, the number of possible instances is linked with one class, i.e. the no ASD is larger than the ASD, and this may cause performance issues such as models favouring the majority class and undermining the minority class. This research experimentally measures the impact of class imbalance issue on the performance of different classifiers on real autism datasets when various data imbalance approaches are utilised in the pre-processing phase. We employ oversampling techniques, such as Synthetic Minority Oversampling (SMOTE), and undersampling with different classifiers including Naive Bayes, RIPPER, C4.5 and Random Forest to measure the impact of these on the performance of the models derived in terms of area under curve and other metrics. Results pinpoint that oversampling techniques are superior to undersampling techniques, at least for the toddlers’ autism dataset that we consider, and suggest that further work should look at incorporating sampling techniques with feature selection to generate models that do not overfit the dataset.
AB - Machine learning (ML) is a branch of computer science that is rapidly gaining popularity within the healthcare arena due to its ability to explore large datasets to discover useful patterns that can be interepreted for decision-making and prediction. ML techniques are used for the analysis of clinical parameters and their combinations for prognosis, therapy planning and support and patient management and wellbeing. In this research, we investigate a crucial problem associated with medical applications such as autism spectrum disorder (ASD) data imbalances in which cases are far more than just controls in the dataset. In autism diagnosis data, the number of possible instances is linked with one class, i.e. the no ASD is larger than the ASD, and this may cause performance issues such as models favouring the majority class and undermining the minority class. This research experimentally measures the impact of class imbalance issue on the performance of different classifiers on real autism datasets when various data imbalance approaches are utilised in the pre-processing phase. We employ oversampling techniques, such as Synthetic Minority Oversampling (SMOTE), and undersampling with different classifiers including Naive Bayes, RIPPER, C4.5 and Random Forest to measure the impact of these on the performance of the models derived in terms of area under curve and other metrics. Results pinpoint that oversampling techniques are superior to undersampling techniques, at least for the toddlers’ autism dataset that we consider, and suggest that further work should look at incorporating sampling techniques with feature selection to generate models that do not overfit the dataset.
KW - autism spectrum disorder
KW - ASD screening
KW - data imbalance
KW - machine learning
KW - undersampling
KW - oversampling
KW - SMOTE
UR - http://www.scopus.com/inward/record.url?scp=85082406049&partnerID=8YFLogxK
U2 - 10.1142/S0219649220400146
DO - 10.1142/S0219649220400146
M3 - Article
VL - 19
JO - Journal of Information and Knowledge Management
JF - Journal of Information and Knowledge Management
SN - 0219-6492
IS - 1
M1 - 2040014
ER -