Building a Spanish Lexicon for Corpus Analysis

  • Ricardo Jiménez-Yáñez (Speaker)
  • Hugo Sanjurjo Gonzalez (Speaker)
  • Paul Rayson (Speaker)
  • Scott Piao (Speaker)

    Activity: Talk or presentation typesPoster presentation

    Description

    The last two decades have seen the development of various semantic lexical resources such as WordNet and the USAS semantic lexicon (Rayson et al., 2004), which have played an important role in the areas of natural language processing and corpus-based studies. Semantic tagging has applications in areas such as metaphor analysis and critical discourse analysis. However, most of the research is focused on the English language. Development of an extended Spanish lexicon will enable this language to be studied more accurately and it will improve outcomes related to natural language processing and corpus-based studies.

    In this paper, we report on the construction of a Spanish semantic lexicon (Piao et al., 2016), which employs the unified Lancaster semantic taxonomy and provides a lexical knowledge base for the automatic UCREL semantic annotation system (USAS).

    In the beginning the Spanish tagger had only a 2,005 single-word semantic lexicon put together by translating the English semantic lexicon entries using a Spanish-English dictionary compiled by Mark Davies (Davies, 2006). Generated by an automatic process, checking to see if the English lexicon has been successfully transported to the Spanish one is a very labour-intensive exercise. Several issues were faced as we will discuss in this paper.

    The Spanish lexicon was created from two main sources: the 1,000 most frequent words in the CORDE corpus and 660 most frequent words which are religious in scope retrieved from (Jiménez, 2017). In addition, words from the field of religion were manually matched/contrasted with keywords from (Casares, 1989, xxxvi-xxxvii). This word list was analysed grammatically and after a filtering, words were semantically tagged by hand.

    A fully developed Spanish lexicon can be employed in several tasks related to applied linguistics, for instance to analyse and categorise the religious stance of an election manifesto, or more generally in order to obtain a sentiment analysis or information about the style of writing. We will illustrate examples in the paper. The lexicon is available for academic use from http://ucrel.lancs.ac.uk/usas/.
    Period4 May 20176 May 2017
    Event titleInternational Conference of the Spanish Association of Applied Linguistics
    Event typeConference
    LocationAndalucia, SpainShow on map