An efficient framework of utilizing the latent semantic analysis in text extraction

Ahmad Hussein Ababneh, Joan Lu, Qiang Xu

Research output: Contribution to journalArticle

Abstract

The use of the latent semantic analysis (LSA) in text mining demands large space and time requirements. This paper proposes a new text extraction method that sets a framework on how to employ the statistical semantic analysis in the text extraction in an efficient way. The method uses the centrality feature and omits the segments of the text that have a high verbatim, statistical, or semantic similarity with previously processed segments. The identification of similarity is based on a new multi-layer similarity method that computes the similarity in three statistical layers, it uses the Jaccard similarity and the vector space model in the first and second layers respectively, and uses the LSA in the third layer. The multi-layer similarity restricts the use of the third layer for the segments that the first and second layers failed to estimate their similarities. Rouge tool is used in the evaluation, but because Rouge does not consider the extract’s size, we supplemented it with a new evaluation strategy based on the compression rate and the ratio of the sentences intersections between the automatic and the reference extracts. Our comparisons with classical LSA and traditional statistical extractions showed that we reduced the use of the LSA procedure by 52%, and we obtained 65% reduction on the original matrix dimensions, also, we obtained remarkable accuracy results. It is concluded that the employment of the centrality feature with the proposed multi-layer framework yields a significant solution in terms of efficiency and accuracy in the field of text extraction.

LanguageEnglish
Pages785-815
Number of pages31
JournalInternational Journal of Speech Technology
Volume22
Issue number3
Early online date21 Aug 2019
DOIs
Publication statusPublished - 1 Sep 2019

Fingerprint

Semantics
semantics
analysis procedure
Vector spaces
evaluation
Layer
Latent Semantic Analysis
efficiency

Cite this

@article{56391b99397a426d81846d49aecfc263,
title = "An efficient framework of utilizing the latent semantic analysis in text extraction",
abstract = "The use of the latent semantic analysis (LSA) in text mining demands large space and time requirements. This paper proposes a new text extraction method that sets a framework on how to employ the statistical semantic analysis in the text extraction in an efficient way. The method uses the centrality feature and omits the segments of the text that have a high verbatim, statistical, or semantic similarity with previously processed segments. The identification of similarity is based on a new multi-layer similarity method that computes the similarity in three statistical layers, it uses the Jaccard similarity and the vector space model in the first and second layers respectively, and uses the LSA in the third layer. The multi-layer similarity restricts the use of the third layer for the segments that the first and second layers failed to estimate their similarities. Rouge tool is used in the evaluation, but because Rouge does not consider the extract’s size, we supplemented it with a new evaluation strategy based on the compression rate and the ratio of the sentences intersections between the automatic and the reference extracts. Our comparisons with classical LSA and traditional statistical extractions showed that we reduced the use of the LSA procedure by 52{\%}, and we obtained 65{\%} reduction on the original matrix dimensions, also, we obtained remarkable accuracy results. It is concluded that the employment of the centrality feature with the proposed multi-layer framework yields a significant solution in terms of efficiency and accuracy in the field of text extraction.",
keywords = "Automatic text extraction, Latent semantic analysis, Multi-layer similarity, Vector space model",
author = "Ababneh, {Ahmad Hussein} and Joan Lu and Qiang Xu",
year = "2019",
month = "9",
day = "1",
doi = "10.1007/s10772-019-09623-8",
language = "English",
volume = "22",
pages = "785--815",
journal = "International Journal of Speech Technology",
issn = "1381-2416",
publisher = "Springer Netherlands",
number = "3",

}

An efficient framework of utilizing the latent semantic analysis in text extraction. / Ababneh, Ahmad Hussein; Lu, Joan; Xu, Qiang.

In: International Journal of Speech Technology, Vol. 22, No. 3, 01.09.2019, p. 785-815.

Research output: Contribution to journalArticle

TY - JOUR

T1 - An efficient framework of utilizing the latent semantic analysis in text extraction

AU - Ababneh, Ahmad Hussein

AU - Lu, Joan

AU - Xu, Qiang

PY - 2019/9/1

Y1 - 2019/9/1

N2 - The use of the latent semantic analysis (LSA) in text mining demands large space and time requirements. This paper proposes a new text extraction method that sets a framework on how to employ the statistical semantic analysis in the text extraction in an efficient way. The method uses the centrality feature and omits the segments of the text that have a high verbatim, statistical, or semantic similarity with previously processed segments. The identification of similarity is based on a new multi-layer similarity method that computes the similarity in three statistical layers, it uses the Jaccard similarity and the vector space model in the first and second layers respectively, and uses the LSA in the third layer. The multi-layer similarity restricts the use of the third layer for the segments that the first and second layers failed to estimate their similarities. Rouge tool is used in the evaluation, but because Rouge does not consider the extract’s size, we supplemented it with a new evaluation strategy based on the compression rate and the ratio of the sentences intersections between the automatic and the reference extracts. Our comparisons with classical LSA and traditional statistical extractions showed that we reduced the use of the LSA procedure by 52%, and we obtained 65% reduction on the original matrix dimensions, also, we obtained remarkable accuracy results. It is concluded that the employment of the centrality feature with the proposed multi-layer framework yields a significant solution in terms of efficiency and accuracy in the field of text extraction.

AB - The use of the latent semantic analysis (LSA) in text mining demands large space and time requirements. This paper proposes a new text extraction method that sets a framework on how to employ the statistical semantic analysis in the text extraction in an efficient way. The method uses the centrality feature and omits the segments of the text that have a high verbatim, statistical, or semantic similarity with previously processed segments. The identification of similarity is based on a new multi-layer similarity method that computes the similarity in three statistical layers, it uses the Jaccard similarity and the vector space model in the first and second layers respectively, and uses the LSA in the third layer. The multi-layer similarity restricts the use of the third layer for the segments that the first and second layers failed to estimate their similarities. Rouge tool is used in the evaluation, but because Rouge does not consider the extract’s size, we supplemented it with a new evaluation strategy based on the compression rate and the ratio of the sentences intersections between the automatic and the reference extracts. Our comparisons with classical LSA and traditional statistical extractions showed that we reduced the use of the LSA procedure by 52%, and we obtained 65% reduction on the original matrix dimensions, also, we obtained remarkable accuracy results. It is concluded that the employment of the centrality feature with the proposed multi-layer framework yields a significant solution in terms of efficiency and accuracy in the field of text extraction.

KW - Automatic text extraction

KW - Latent semantic analysis

KW - Multi-layer similarity

KW - Vector space model

UR - http://www.scopus.com/inward/record.url?scp=85071299019&partnerID=8YFLogxK

U2 - 10.1007/s10772-019-09623-8

DO - 10.1007/s10772-019-09623-8

M3 - Article

VL - 22

SP - 785

EP - 815

JO - International Journal of Speech Technology

T2 - International Journal of Speech Technology

JF - International Journal of Speech Technology

SN - 1381-2416

IS - 3

ER -