Creación de un framework para el tratamiento de corpus lingüísticos

Translated title of the contribution: Development of a framework for corpus linguistic analysis

H. Sanjurjo-González

Research output: Book/ReportBook


Despite the unquestionable advancement in corpus linguistics software in recent
times, which includes supporting larger corpora and more complex statistics, it still fails to take into account is still failing to take into account the usability and profile of the end-users. This is more evident when the working language is not English or a combination of languages, as their typology and linguistic idiosyncrasies affect software requirements, making resource availability less reliable, both in quantity and quality.

A review of the state-of-the-art shows that the creation of monolingual,
bi/multilingual parallel and comparable corpora as well as the incorporation of
linguistic annotation layers in the current frameworks requires programming skills from the user. These include knowing how to execute computer programs with limited usability and/or custom programming scripts as to adapt the corpus to the specifications of a particular corpus analysis software. If this is not the case, technical advice from staff with programming and NLP skills is necessary.

The objective of this study is, therefore, the development of a framework, the
ACTRES Corpus Manager, which allows users to create their own corpora (monolingual, bi/multilingual parallel and comparable) with several annotation
layers (grammatical, semantic and rhetorical), make linguistic queries and obtain
the most common statistics without technical assistance during the process and
regardless of the technical skills of the user.

ACTRES Corpus Manager has been designed as a web-accessible framework
composed of several interconnected components. Each corpus-creating task is
assigned to one component, a strategy which enables easy reuse and modification. ACTRES Corpus Manager combines the use of already existing software, whose efficiency and validity is well known (e.g. The IMS Corpus Workbench, Treetagger, hunalign, etc.), together with additional custom-built software for those processes that our state-of-the-art analysis has shown to be immature and/or difficult to integrate (e.g. rhetorical tagger, semantic tagger, etc.)

Finally, it must be pointed out that the ACTRES Corpus Manager interface is based on P-ACTRES 2.0. It allows the user to make assisted complex queries using regular expressions and obtaining the more common corpus statistics without having expert knowledge of the corpus query language syntax.
Translated title of the contributionDevelopment of a framework for corpus linguistic analysis
Original languageSpanish
PublisherÁrea de Publicaciones - Universidad de León
Number of pages116
ISBN (Print)9788497739405
Publication statusPublished - Dec 2018
Externally publishedYes


Dive into the research topics of 'Development of a framework for corpus linguistic analysis'. Together they form a unique fingerprint.

Cite this