Imputation of Missing Clinical Covariates for Downstream Classification Problems

Benjamin Agbo, Hussain Al-Aqrabi, Tariq Alsboui, Muhammad Hussain, Richard Hill

Research output: Contribution to journalArticlepeer-review


Noticeable growth in the use of intelligent devices has resulted in the generation of vast amounts of data from sensor devices. When dealing with large amounts of data, it is common to observe databases with large amounts of missing values. This is a challenge for data miners because various methods for data analysis only work well on complete databases. A traditional approach to handling missing data is to discard instances of missing values and only use complete cases for analysis. However, research has shown that this approach is not practical especially when large amounts of data are missing. This led to an increased need to develop strategies for replacing missing values with plausible values through imputation. This study presents an imputation strategy called <italic>med.BFMVI</italic> for recovering missing values before training downstream classification models. Experiments simulated missingness from 10% to 40% using MCAR and MAR mechanisms and the performance of the proposed technique was measured against state-of-the-art techniques. Overall, the proposed algorithm recorded the best imputation accuracy as opposed to benchmark techniques and showed significant improvements on downstream learning.

Original languageEnglish
Article number10256187
Pages (from-to)102935-102943
Number of pages9
JournalIEEE Access
Early online date20 Sep 2023
Publication statusPublished - 26 Sep 2023

Cite this