Abstract
Educational organisations have, for some time, taken advantage of the benefits of VR to provide team-based learning for disparate groups of students. With future VR consumer devices offering the possibility of built-in facial tracking and visualisation of emotions, this introduces a number of opportunities and hurdles for students who have difficulty recognising the emotions of others. Many such students have, to some extent, been historically attracted to VR due to its lack of emotional interaction, yet the future may see new VR worlds (termed metaverse) where emotion is embodied within 3D avatars, in an effort to recreate more realistic real-world social interaction.This thesis charts the design and creation of an emotional framework, EMPACT-VR, aimed at helping these students recognise emotions during conversations within VR learning environments. It will utilise both facial and audio emotion recognition technologies, offering the possibility for emotions to be recorded for later analysis.
Until this thesis, no research has explored the possibility of predicting bi-modal emotion recognition on 3D avatar and audio data that is already held in some existing VR and game platforms. A key problem addressed in the thesis is whether emotion recognition is possible within the sparse data constraints of a VR environment, specifically predicting emotion from the facial data of 3D avatars rather than, as traditional methods do, solely concentrating on image or video data. An additional problem is whether bi-modal emotion recognition can work effectively within a VR environment, providing quick and useful predictive feedback. With the growing popularity of multi-player online games and metaverse platforms, by addressing this research problem, this thesis covers a key gap in knowledge which may benefit these platforms in the future.
A key part of this research is the design of a novel feature extractor that takes input from a Webcam or video feed and provides facial data to visualise the 3D avatars within an environment. This facial data is also used to train a ML model which can perform facial emotion prediction. Additionally, an audio ML model is created that is designed to be accurate over a range of different languages, accurately at intervals as low as 1 second. Lastly, the research demonstrates a form of late fusion in which facial and audio ML models can be deployed together or separately.
The research’s experimental results show that this theoretical framework allows for bimodal emotion recognition, that is quick enough to be utilised within an existing VR environment via a visual feedback indicator. Furthermore, the research results show that, even in the relatively sparse data environment that is VR, the framework can achieve promising accuracy compared to some recent academic validation studies. As more advanced 3D avatars become available in the future, with better and more precise facial movement, the potential pathway to further improve the accuracy framework is clear.
A key contribution of this thesis is to show VR and in particular the metaverse, are online virtual environments where emotion recognition can become a proposition that helps people’s lives.
| Date of Award | 3 Nov 2025 |
|---|---|
| Original language | English |
| Supervisor | Minsi Chen (Main Supervisor) & Duke Gledhill (Co-Supervisor) |