Creating a Dataset for Domain Bilingual Semantic Annotation based on the USAS Framework

  • Hugo Sanjurjo Gonzalez (Speaker)
  • Rosa Rabadán (Speaker)
  • César Gutiérrez-Pérez (Speaker)

Activity: Talk or presentation typesOral presentation


Semantic annotation is crucial for a significant number of linguistic and NLP tasks, including information extraction (Acosta 2011), text mining (Rayson 2010), language learning (Brooke et al. 2015), lexicography and lexicology (Torner and Bernal 2017) or designing controlled natural languages (Davis et al 2009), among others. This paper reports on the construction of a dataset to be used for semi-supervised semantic annotation in English and Spanish in the domain ‘Food and Drink’ (USAS F1 and F2). The aim is to create an error-free lexicon about Food and Drink that complements Spanish USAS (Jiménez-Yáñez et al. 2017) and a corpus to serve as training set of a future model based on fastText (Bojanowski et al. 2017).

Starting from a small comparable corpus (842,516 words) comprising six subcorpora corresponding to different genres in each of the languages, we extracted frequency lists which were scored manually for domain relevant terms. These were manually annotated after the USAS model and revised by all participants simultaneously to minimize possible inter-coder discrepancies. This initial master lexicon contained nearly 2,000 entries between single items and multiword expressions. MWE were a challenge as neither linguistic nor NLP standard approaches (Sag et al. 2002; Baldwin and Kim 2010; Ramisch 2015; Monti et al. 2018) were adequate to account for cross-linguistic MWE semantic patterns. Unlike recent approaches which focus on i) morphologically defined taxonomies (Escartín et al. 2018), ii) exclusively multiword translation equivalents(Clematide et al. 2018), or iii) retrieval and translation systems (Mendoza et al. 2018), we decided to extract recurrent, domain productive semantic patterns (e.g. En F1/L3+ F2 orange juice/ F2 + F1/L3 Sp zumo de naranja) including those typically associated to culture-bound, opaque meanings (e.g. Z2 +F1 Black forest gateau/ F1 + Z2 pastel Selva Negra; Z5+ Z2 a la gallega/ Z2 + F1 Galician octopus; Z5 + Z5 + O2 a la cazuela, al horno/ F1 casserole; F1/O4.6 roast, etc.)

The comparable corpus has also been used to train a simple model based on word vectors, more precisely on fastText. This model, together with a custom algorithm, has helped us to guess the annotation of the words that are not contained in the master lexicon. Even using such a small corpus (842,516 words) as a training set our results have gone from 52,63% successful detection if using the original USAS Spanish lexicon, to 70,76% when using the USAS Spanish Lexicon and our master lexicon to an encouraging 81% if using our algorithm. These percentages only take into account specific domain (food and drink) terms.

Adjustments in the algorithm and the model together with additions to the master lexicon will hopefully improve our results making the creation of new multilayer corpora/ the enlargement of existing ones easier and quicker. Additional goals and applications are also put forward. The procedure can be replicated for other specific domains.
Period16 May 2019
Event titleXI International Conference on Corpus Linguistics: Corpus Approaches to Discourse Analysis
Event typeConference
Conference number11
LocationValencia, SpainShow on map
Degree of RecognitionInternational