Creación de un framework para el tratamiento de corpus lingüísticos

Translated title of the contribution: Development of a framework for corpus linguistic analysis

Research output: Book/ReportBook

Abstract

Despite the unquestionable advancement in corpus linguistics software in recent
times, which includes supporting larger corpora and more complex statistics, it still fails to take into account is still failing to take into account the usability and profile of the end-users. This is more evident when the working language is not English or a combination of languages, as their typology and linguistic idiosyncrasies affect software requirements, making resource availability less reliable, both in quantity and quality.

A review of the state-of-the-art shows that the creation of monolingual,
bi/multilingual parallel and comparable corpora as well as the incorporation of
linguistic annotation layers in the current frameworks requires programming skills from the user. These include knowing how to execute computer programs with limited usability and/or custom programming scripts as to adapt the corpus to the specifications of a particular corpus analysis software. If this is not the case, technical advice from staff with programming and NLP skills is necessary.

The objective of this study is, therefore, the development of a framework, the
ACTRES Corpus Manager, which allows users to create their own corpora (monolingual, bi/multilingual parallel and comparable) with several annotation
layers (grammatical, semantic and rhetorical), make linguistic queries and obtain
the most common statistics without technical assistance during the process and
regardless of the technical skills of the user.

ACTRES Corpus Manager has been designed as a web-accessible framework
composed of several interconnected components. Each corpus-creating task is
assigned to one component, a strategy which enables easy reuse and modification. ACTRES Corpus Manager combines the use of already existing software, whose efficiency and validity is well known (e.g. The IMS Corpus Workbench, Treetagger, hunalign, etc.), together with additional custom-built software for those processes that our state-of-the-art analysis has shown to be immature and/or difficult to integrate (e.g. rhetorical tagger, semantic tagger, etc.)

Finally, it must be pointed out that the ACTRES Corpus Manager interface is based on P-ACTRES 2.0. It allows the user to make assisted complex queries using regular expressions and obtaining the more common corpus statistics without having expert knowledge of the corpus query language syntax.
Original languageSpanish
PublisherÁrea de Publicaciones - Universidad de León
Number of pages116
ISBN (Print)9788497739405
Publication statusPublished - Dec 2018
Externally publishedYes

Fingerprint

Linguistics
Managers
Statistics
Semantics
Query languages
Computer programming
Computer program listings
Availability
Specifications

Cite this

Sanjurjo-González, H. / Creación de un framework para el tratamiento de corpus lingüísticos. Área de Publicaciones - Universidad de León, 2018. 116 p.
@book{08a08557ba00485ab8d2920eaafb22ce,
title = "Creaci{\'o}n de un framework para el tratamiento de corpus ling{\"u}{\'i}sticos",
abstract = "Despite the unquestionable advancement in corpus linguistics software in recenttimes, which includes supporting larger corpora and more complex statistics, it still fails to take into account is still failing to take into account the usability and profile of the end-users. This is more evident when the working language is not English or a combination of languages, as their typology and linguistic idiosyncrasies affect software requirements, making resource availability less reliable, both in quantity and quality. A review of the state-of-the-art shows that the creation of monolingual,bi/multilingual parallel and comparable corpora as well as the incorporation oflinguistic annotation layers in the current frameworks requires programming skills from the user. These include knowing how to execute computer programs with limited usability and/or custom programming scripts as to adapt the corpus to the specifications of a particular corpus analysis software. If this is not the case, technical advice from staff with programming and NLP skills is necessary.The objective of this study is, therefore, the development of a framework, theACTRES Corpus Manager, which allows users to create their own corpora (monolingual, bi/multilingual parallel and comparable) with several annotation layers (grammatical, semantic and rhetorical), make linguistic queries and obtainthe most common statistics without technical assistance during the process andregardless of the technical skills of the user.ACTRES Corpus Manager has been designed as a web-accessible frameworkcomposed of several interconnected components. Each corpus-creating task isassigned to one component, a strategy which enables easy reuse and modification. ACTRES Corpus Manager combines the use of already existing software, whose efficiency and validity is well known (e.g. The IMS Corpus Workbench, Treetagger, hunalign, etc.), together with additional custom-built software for those processes that our state-of-the-art analysis has shown to be immature and/or difficult to integrate (e.g. rhetorical tagger, semantic tagger, etc.)Finally, it must be pointed out that the ACTRES Corpus Manager interface is based on P-ACTRES 2.0. It allows the user to make assisted complex queries using regular expressions and obtaining the more common corpus statistics without having expert knowledge of the corpus query language syntax.",
keywords = "Corpus linguistics, Computational linguistic, Framework",
author = "H. Sanjurjo-Gonz{\'a}lez",
year = "2018",
month = "12",
language = "Spanish",
isbn = "9788497739405",
publisher = "{\'A}rea de Publicaciones - Universidad de Le{\'o}n",

}

Creación de un framework para el tratamiento de corpus lingüísticos. / Sanjurjo-González, H.

Área de Publicaciones - Universidad de León, 2018. 116 p.

Research output: Book/ReportBook

TY - BOOK

T1 - Creación de un framework para el tratamiento de corpus lingüísticos

AU - Sanjurjo-González, H.

PY - 2018/12

Y1 - 2018/12

N2 - Despite the unquestionable advancement in corpus linguistics software in recenttimes, which includes supporting larger corpora and more complex statistics, it still fails to take into account is still failing to take into account the usability and profile of the end-users. This is more evident when the working language is not English or a combination of languages, as their typology and linguistic idiosyncrasies affect software requirements, making resource availability less reliable, both in quantity and quality. A review of the state-of-the-art shows that the creation of monolingual,bi/multilingual parallel and comparable corpora as well as the incorporation oflinguistic annotation layers in the current frameworks requires programming skills from the user. These include knowing how to execute computer programs with limited usability and/or custom programming scripts as to adapt the corpus to the specifications of a particular corpus analysis software. If this is not the case, technical advice from staff with programming and NLP skills is necessary.The objective of this study is, therefore, the development of a framework, theACTRES Corpus Manager, which allows users to create their own corpora (monolingual, bi/multilingual parallel and comparable) with several annotation layers (grammatical, semantic and rhetorical), make linguistic queries and obtainthe most common statistics without technical assistance during the process andregardless of the technical skills of the user.ACTRES Corpus Manager has been designed as a web-accessible frameworkcomposed of several interconnected components. Each corpus-creating task isassigned to one component, a strategy which enables easy reuse and modification. ACTRES Corpus Manager combines the use of already existing software, whose efficiency and validity is well known (e.g. The IMS Corpus Workbench, Treetagger, hunalign, etc.), together with additional custom-built software for those processes that our state-of-the-art analysis has shown to be immature and/or difficult to integrate (e.g. rhetorical tagger, semantic tagger, etc.)Finally, it must be pointed out that the ACTRES Corpus Manager interface is based on P-ACTRES 2.0. It allows the user to make assisted complex queries using regular expressions and obtaining the more common corpus statistics without having expert knowledge of the corpus query language syntax.

AB - Despite the unquestionable advancement in corpus linguistics software in recenttimes, which includes supporting larger corpora and more complex statistics, it still fails to take into account is still failing to take into account the usability and profile of the end-users. This is more evident when the working language is not English or a combination of languages, as their typology and linguistic idiosyncrasies affect software requirements, making resource availability less reliable, both in quantity and quality. A review of the state-of-the-art shows that the creation of monolingual,bi/multilingual parallel and comparable corpora as well as the incorporation oflinguistic annotation layers in the current frameworks requires programming skills from the user. These include knowing how to execute computer programs with limited usability and/or custom programming scripts as to adapt the corpus to the specifications of a particular corpus analysis software. If this is not the case, technical advice from staff with programming and NLP skills is necessary.The objective of this study is, therefore, the development of a framework, theACTRES Corpus Manager, which allows users to create their own corpora (monolingual, bi/multilingual parallel and comparable) with several annotation layers (grammatical, semantic and rhetorical), make linguistic queries and obtainthe most common statistics without technical assistance during the process andregardless of the technical skills of the user.ACTRES Corpus Manager has been designed as a web-accessible frameworkcomposed of several interconnected components. Each corpus-creating task isassigned to one component, a strategy which enables easy reuse and modification. ACTRES Corpus Manager combines the use of already existing software, whose efficiency and validity is well known (e.g. The IMS Corpus Workbench, Treetagger, hunalign, etc.), together with additional custom-built software for those processes that our state-of-the-art analysis has shown to be immature and/or difficult to integrate (e.g. rhetorical tagger, semantic tagger, etc.)Finally, it must be pointed out that the ACTRES Corpus Manager interface is based on P-ACTRES 2.0. It allows the user to make assisted complex queries using regular expressions and obtaining the more common corpus statistics without having expert knowledge of the corpus query language syntax.

KW - Corpus linguistics

KW - Computational linguistic

KW - Framework

UR - https://dialnet.unirioja.es/servlet/libro?codigo=727065

M3 - Book

SN - 9788497739405

BT - Creación de un framework para el tratamiento de corpus lingüísticos

PB - Área de Publicaciones - Universidad de León

ER -

Sanjurjo-González H. Creación de un framework para el tratamiento de corpus lingüísticos. Área de Publicaciones - Universidad de León, 2018. 116 p.