Data Analytics and Machine Learning to Understand and Predict Student Performance

  • Azhan Rashid

Student thesis: Doctoral Thesis


Predicting student performance has attracted significant research interest in recent years, owing primarily to its potential benefits to both students, in terms of improving outcomes and post-graduation prospects, and educational institutions, in terms of addressing issues such as differential attainment and targeted proactive support of students at risk of lower performance. Substantial research effort has been devoted to exploring data analysis and machine learning techniques in this context. One of the main challenges is the availability of large and high-quality datasets and associated issues such as data imbalance and limited scope of data analysis. Additionally, most researchers focus on predicting performance in the form of a single predicted score, as opposed to a range of potential outcomes.
In this thesis, the aforementioned research gaps are addressed through a computational framework to predict student performance ranges using data analysis and machine learning. The framework contains a unique combination of layers ranging from data preprocessing to statistical analysis and learning prediction models, with each layer carefully positioned to avoid any biased outcomes. This increases confidence in the produced outcomes.
The proposed framework is validated using a rich, anonymised dataset provided by the University of Huddersfield that contains significantly more samples and relevant variables than what is commonly observed in the literature. Experiments focus on predicting the performance of students based on data available at the point of enrolment. This includes students that are completing their pre-qualifications for entrance (e.g. A Levels) and allows exploring the widest possible group of students available in the dataset. The predictions produced from the conducted experiments represent a range of overall grade achievement (boundaries) at the end of their course.
Results show an accuracy of 84%/86% (worst/common case scenario). Baseline comparison shows an improvement of 3%/5% (worst/common case scenario) compared to existing literature. In most cases, improvement is seen in both the best and the worst performing models. This robustness of the framework can be partly attributed to including means of tackling data imbalance, as well as exploring a wide range of data analysis and machine learning models.
The main contributions of this thesis and the included framework involve: predicting students' performance in the form of a range; integrating approaches to tackle imbalanced data; performing in-depth data analysis using a range of statistical methods; and considering both supervised and unsupervised learning algorithms. It is envisioned that the framework can be integrated into existing student performance dashboard systems, allowing academics and administrators to harness its predictive capabilities and drive decision-making to improve outcomes across the student body or targeted efforts, such as reducing differential attainment.
Date of Award31 May 2023
Original languageEnglish
SupervisorGeorge Bargiannis (Main Supervisor), Jarek Bryk (Co-Supervisor) & Andrew Crampton (Co-Supervisor)

Cite this