Semantic Annotation and Mark Up for Enhancing Lexical Searches

  • Jeffries, Lesley (CoI)
  • Alexander, Marc Gabriel (PI)
  • Baron, Alistair (CoI)
  • Archer, Dawn (CoI)
  • Hope, Jonathan (CoI)
  • Rayson, Paul Edward (CoI)
  • Kay, Christian Janet (CoI)

Project: ResearchResearch Council - AHRC


As humanities datasets get ever larger, researchers have a pressing need for more sophisticated techniques of analysis. The most significant issue in big data research into textual datasets is that our primary methodology for searching, aggregating and analysing them relies not on concepts or meanings, but rather on word forms. These forms are imperfect and evasive proxies for the meanings they refer to, and with 60% of word forms in English referring to more than one meaning, and some word forms referring to close to two hundred meanings, the irrelevant "noise" which appears when searching using word forms grows with the size of the texts being searched.
In big data contexts, this problem cripples research, making any sort of detailed analysis entirely intractable and requiring impossible amounts of manual intervention. In this project, we will deliver a system for automatically annotating words in texts with their precise meanings, enabling a step-change in the way we deal with large textual data. The system is based around the unparalleled Historical Thesaurus of English, which contains 797,000 words from across the history of English arranged into 236,000 hierarchical categories of meanings alongside each word's dates of known use. The annotation software will take a text and provide for each word it contains an XML annotation giving the word meaning's Historical Thesaurus category code. The system will automatically disambiguate word meanings using a range of state-of-the-art computational techniques alongside new context-dependent methods unlocked by the Thesaurus's dating codes and its uniquely detailed and fine-grained hierarchical structure.
Textual data tagged in this way can then be accurately searched and precisely investigated, with any results also able to be aggregated at a range of levels of precision, without the need for manual intervention. A major part of the project is also the development of new techniques for working with semantically-aggregated and disambiguated data. Project partners will conduct research on resources including the Hansard Corpus, consisting of over 2.3 billion words of text, the Oxford English Corpus, the world's largest stratified corpus of modern English, and the EEBO-TCP corpus of 40,000 early modern books. As part of our work on changing the nature of how we deal with data on this scale, we will mine these text collections for frequently-occurring or statistically unusual concepts, will take advantage of our ability to search large datasets for terms realised by ambiguous word forms (such as "union" in the particular context of industrial relations rather than any of the other 33 possible meanings of this word), and will examine the data as a whole from a distant-reading perspective in order to look for striking or significant patterns of meaning changes across time.
These research projects based on tagged data will also drive the development of our tools for using this data, with teams of researchers across the UK and abroad providing a range of different demands on the data, ensuring a variety of needs and use-cases are catered for in the development of the project. In this way, we are committed to producing a set of compelling, fruitful, and practical research outcomes using semantically-tagged data during the lifetime of the project, in order to demonstrate the value of our approach and to help ensure the work of the project is as widely utilised and exploited as possible.
By doing all of this, we will enable new and transformative techniques of
exploring, searching and investigating large-scale cultural, literary, historical and linguistic phenomena in big humanities datasets; through this project, it will be possible to place meaning - rather than word forms - at the heart of digital humanities research into text.
Effective start/end date1/01/1431/03/15