Auto-grading of short answers forms a specific task in automated assessment systems that evaluates and grades student responses to short answer questions effectively. Traditional manual grading of short answers can be time-consuming and subjective, as it requires human involvement to assess and mark each student response and results can be inconsistent because of different graders. Sustaining student learning outcomes in the rapidly changing field of technology-driven education requires efficient systems for evaluation and feedback. The concept of auto-grading provides a solution to such challenges by delivering faster, more consistent, and fairer grading. In this thesis, a new model for auto-grading short answers is proposed and developed using the transformer and artificial neural network, after first presenting a detailed literature review of the existing methods applied to auto-grading of short answers, and identifying gaps and conflicts in this area. Answer grading systems are considered that employ transformer-based designs, such as BERT(Bidirectional Encoder Representations of Transformers)), RoBERTa (Robustly Optimised BERT pre-training Approach), and Distil BERT(Distilled BERT). A hybrid LSTM (Long short-term memory) RNN (Recurrent Neural Network) architecture is developed which outperforms state-of-the-art techniques in terms of accuracy, precision, recall and F1 score. This model is compared with mathematical approaches such as K-Nearest Neighbor (KNN), Random forest classifier, Decision tree classifier, Logistic regression, Gaussian Naïve Bayes(Gaussian NB) and Support vector classification(SVC) machine learning techniques, and a statistical basis is provided for the model’s performance. The study seeks to establish whether this hybrid strategy is better than standard machine learning techniques by fusing neural network designs Bi-LSTM with the transformer-based BERT, RoBERTa and distil BERT model. A small dataset was built initially to develop the model. Later, a suitable large dataset, the Mohler/ Texas dataset and Extended Mohler/Extended Texas dataset were used to evaluate the model. On successful implementation of the proposed model, the model was validated using a real-time pilot dataset and attained an effective result. The algorithm achieves an accuracy of 99.86% which outperforms all state-of-the-art models and architecture discussed. To evaluate the system's reliability, accuracy, and efficiency, the cutting-edge language model ChatGPT is used in simulating human judgement. The model is also evaluated with real-time data collected from a private institution context and with the FB HATE dataset to evaluate the complexity of the grading algorithm. This thesis also compares cross-validation of samples to conclude the effective performance of the algorithm. A pilot case study at an educational institution is also provided, to show the efficiency, accuracy, and reliability of the system and its benefits for teachers and students.
Date of Award | 2 Jul 2024 |
---|
Original language | English |
---|
Supervisor | Joan Lu (Main Supervisor), George Bargiannis (Co-Supervisor) & Qiang Xu (Co-Supervisor) |
---|