Making corpus-based searching accessible for non-expert users: the case of Hansard

  • Lesley Jeffries (Speaker)
  • Hugo Sanjurjo Gonzalez (Speaker)

Activity: Talk or presentation types › Oral presentation


The SAMUELS project (grant reference AH/L010062/1) released a corpus composed by UK House of Commons and Lords debates from 1803 to 2005 with a grammatical and semantic annotation ( based on the Historical Thesaurus Semantic Tagger (Piao, Dallachy, Baron, Rayson & Alexander, 2014; Alexander, Dallachy, Piao, Baron & Rayson, 2015). The corpus and interface are designed for a specialist audience related to the corpus linguistics field. This means that a level of expertise is required to carry out complex queries that include semantic annotation.Parliamentary debates are also available using the official UK Parliamentsite ( which allows users to search specific debates and members of parliament but does not allow for searches based on linguistic parameters that may discover patterns of debate to interest some users who lack the requisite expertise to search in the SAMUELS corpus.

One of the main goals of the Hansard at Huddersfield project is to bring corpus linguistic methods to the general public in an easy and straightforward manner. So, we are trying to overcome the weaknesses (for the general end-user)of the previously mentioned websites: required expertise in the Hansard Corpus and the lack of corpus linguistics methods on the official Hansard website. To achieve this, first we carried out a series of consultations with potential users such as politicians, journalists or historians, in order to establish the optimum set of searches and functionalities that they demand as an alternative to the current Hansard website. To make corpus linguistic methods easy to use and understand we make use of visualisations that allow us to explain results of common corpus linguistics functions such as keywords, list of frequencies or collocations in a visually appealing way.

From a technical perspective, we have developed a new accessible and web-based front end to enable the general public to access and make use of the corpus, including its semantic annotation. We also transformed the Hansard corpus into a relational data model to facilitate future developments, compatibility with other front ends and linking new data with external databases. For visualisations, we employ D3.js library, a JavaScript library for manipulating documents using common Web technologies.

Period15 May 2019
Event titleXI International Conference on Corpus Linguistics: Corpus Approaches to Discourse Analysis
Event typeConference
Conference number11
LocationValencia, SpainShow on map
Degree of RecognitionInternational