Making Hansard accessible to non-expert users through data visualisation

Jeffries, L. (Speaker), Hugo Sanjurjo Gonzalez (Speaker)

Activity: Talk or presentation typesOral presentation


One of the main outputs of the SAMUELS project (grant reference AH/L010062/1) was a corpus composed of UK House of Commons and Lords debates from 1803 to 2005, including a rich annotation of meaning categories using the Historical Thesaurus SemanticTagger (Alexander et al. 2015; Piao et al. 2014). However, this corpus has a peculiar dual structure designed specifically to be easily used with CQPsoftware, such as CQPweb (Hardie 2012), and closely related to the corpus linguistics research field, for instance (Dem-men, Jeffries & Walker 2018).In addition, the level of expertise is required to carry out searches that includes semantic features is a barrier to accessibility by non-expert users.

As a continuation of this work, this project aims to convert the corpus into an accessible resource and enable non-academic professionals and the public to access and make use of the semantically annotated Hansard corpus by producing an user-friendly web-based front end with intuitive searches and associated visualisations. As a result of the inclusion of visualisations such as timelines, word clouds or sunbursts, the user will be able to find out much more about how parliament debates. Whilst such tools have been tried out in other applications (see Sykora et al 2015), the proposed use of visualisation is an innovation in the parliamentary record.

Technically, the former Hansard corpus was transformed into a relational data model. This transformation favours any future integration with other front ends, makes retrieval of data more efficient and is a straightforward way to link to other databases providing information not yet present in the Hansard data, for instance MP information from Wikidata. For this to be achieved, PostgreSQL database management system is used as the back end whereas Bootstrap V4 is selected for front-end design. Most visualisations employ d3.js and some Python libraries like gensim are used for different NLP processes such as word similarity. Last, PHP programming language is used for retrieving, transforming and feeding data into the execution pipeline.

Period1 Jun 2019
Event titleICAME 40: Language in Time, Time in Language
Event typeConference
Conference number40
LocationNeuchâtel, Switzerland
Degree of RecognitionInternational