Abstract
Despite the unquestionable advancement in corpus linguistics software in recent
times, which includes supporting larger corpora and more complex statistics, it still fails to take into account is still failing to take into account the usability and profile of the end-users. This is more evident when the working language is not English or a combination of languages, as their typology and linguistic idiosyncrasies affect software requirements, making resource availability less reliable, both in quantity and quality.
A review of the state-of-the-art shows that the creation of monolingual,
bi/multilingual parallel and comparable corpora as well as the incorporation of
linguistic annotation layers in the current frameworks requires programming skills from the user. These include knowing how to execute computer programs with limited usability and/or custom programming scripts as to adapt the corpus to the specifications of a particular corpus analysis software. If this is not the case, technical advice from staff with programming and NLP skills is necessary.
The objective of this study is, therefore, the development of a framework, the
ACTRES Corpus Manager, which allows users to create their own corpora (monolingual, bi/multilingual parallel and comparable) with several annotation
layers (grammatical, semantic and rhetorical), make linguistic queries and obtain
the most common statistics without technical assistance during the process and
regardless of the technical skills of the user.
ACTRES Corpus Manager has been designed as a web-accessible framework
composed of several interconnected components. Each corpus-creating task is
assigned to one component, a strategy which enables easy reuse and modification. ACTRES Corpus Manager combines the use of already existing software, whose efficiency and validity is well known (e.g. The IMS Corpus Workbench, Treetagger, hunalign, etc.), together with additional custom-built software for those processes that our state-of-the-art analysis has shown to be immature and/or difficult to integrate (e.g. rhetorical tagger, semantic tagger, etc.)
Finally, it must be pointed out that the ACTRES Corpus Manager interface is based on P-ACTRES 2.0. It allows the user to make assisted complex queries using regular expressions and obtaining the more common corpus statistics without having expert knowledge of the corpus query language syntax.
times, which includes supporting larger corpora and more complex statistics, it still fails to take into account is still failing to take into account the usability and profile of the end-users. This is more evident when the working language is not English or a combination of languages, as their typology and linguistic idiosyncrasies affect software requirements, making resource availability less reliable, both in quantity and quality.
A review of the state-of-the-art shows that the creation of monolingual,
bi/multilingual parallel and comparable corpora as well as the incorporation of
linguistic annotation layers in the current frameworks requires programming skills from the user. These include knowing how to execute computer programs with limited usability and/or custom programming scripts as to adapt the corpus to the specifications of a particular corpus analysis software. If this is not the case, technical advice from staff with programming and NLP skills is necessary.
The objective of this study is, therefore, the development of a framework, the
ACTRES Corpus Manager, which allows users to create their own corpora (monolingual, bi/multilingual parallel and comparable) with several annotation
layers (grammatical, semantic and rhetorical), make linguistic queries and obtain
the most common statistics without technical assistance during the process and
regardless of the technical skills of the user.
ACTRES Corpus Manager has been designed as a web-accessible framework
composed of several interconnected components. Each corpus-creating task is
assigned to one component, a strategy which enables easy reuse and modification. ACTRES Corpus Manager combines the use of already existing software, whose efficiency and validity is well known (e.g. The IMS Corpus Workbench, Treetagger, hunalign, etc.), together with additional custom-built software for those processes that our state-of-the-art analysis has shown to be immature and/or difficult to integrate (e.g. rhetorical tagger, semantic tagger, etc.)
Finally, it must be pointed out that the ACTRES Corpus Manager interface is based on P-ACTRES 2.0. It allows the user to make assisted complex queries using regular expressions and obtaining the more common corpus statistics without having expert knowledge of the corpus query language syntax.
Translated title of the contribution | Development of a framework for corpus linguistic analysis |
---|---|
Original language | Spanish |
Publisher | Área de Publicaciones - Universidad de León |
Number of pages | 116 |
ISBN (Print) | 9788497739405 |
Publication status | Published - Dec 2018 |
Externally published | Yes |