Identifying Domains and Concept in Short Texts Via Partial Taxonomy and Unlabeled Data

Yongrui Qin, Yihong Zhang, Claudia Szabo, Quan Z. Sheng, Wei Emma Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Accurate and real-time identification of domains and concepts discussed in microblogging texts is crucial for many important applications such as earthquake monitoring, influenza surveillance and disaster management. Existing techniques such as machine learning and keyword generation are application specific and require significant amount of training in order to achieve high accuracy. In this paper, we propose to use a multiple domain taxonomy (MDT) to capture general user knowledge. We formally define the problems of domain classification and concept tagging. Using the MDT, we devise domain-independent pure frequency count methods that do not require any training data nor annotations and that are not sensitive to misspellings or shortened word forms. Our extensive experimental analysis on real Twitter data shows that both methods have significantly better identification accuracy with low runtime than existing methods for large datasets.

LanguageEnglish
Title of host publicationAdvanced Information Systems Engineering - 29th International Conference, CAiSE 2017
PublisherSpringer Verlag
Pages127-143
Number of pages17
ISBN (Print)9783319595351
DOIs
Publication statusPublished - 1 Jan 2017
Event29th International Conference on Advanced Information Systems Engineering - Essen, Germany
Duration: 12 Jun 201716 Jun 2017
Conference number: 29
http://caise2017.paluno.de/welcome/ (Link to Conference Website)

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10253 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference29th International Conference on Advanced Information Systems Engineering
Abbreviated titleCAISE'17
CountryGermany
CityEssen
Period12/06/1716/06/17
Internet address

Fingerprint

Taxonomies
Taxonomy
Partial
Disasters
Learning systems
Earthquakes
Monitoring
Disaster Management
Influenza
Tagging
Experimental Analysis
Earthquake
Large Data Sets
Surveillance
Annotation
Machine Learning
Count
High Accuracy
Text
Concepts

Cite this

Qin, Y., Zhang, Y., Szabo, C., Sheng, Q. Z., & Zhang, W. E. (2017). Identifying Domains and Concept in Short Texts Via Partial Taxonomy and Unlabeled Data. In Advanced Information Systems Engineering - 29th International Conference, CAiSE 2017 (pp. 127-143). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10253 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-319-59536-8_9
Qin, Yongrui ; Zhang, Yihong ; Szabo, Claudia ; Sheng, Quan Z. ; Zhang, Wei Emma. / Identifying Domains and Concept in Short Texts Via Partial Taxonomy and Unlabeled Data. Advanced Information Systems Engineering - 29th International Conference, CAiSE 2017. Springer Verlag, 2017. pp. 127-143 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{a064750870d849e1baa65fd0af9c114c,
title = "Identifying Domains and Concept in Short Texts Via Partial Taxonomy and Unlabeled Data",
abstract = "Accurate and real-time identification of domains and concepts discussed in microblogging texts is crucial for many important applications such as earthquake monitoring, influenza surveillance and disaster management. Existing techniques such as machine learning and keyword generation are application specific and require significant amount of training in order to achieve high accuracy. In this paper, we propose to use a multiple domain taxonomy (MDT) to capture general user knowledge. We formally define the problems of domain classification and concept tagging. Using the MDT, we devise domain-independent pure frequency count methods that do not require any training data nor annotations and that are not sensitive to misspellings or shortened word forms. Our extensive experimental analysis on real Twitter data shows that both methods have significantly better identification accuracy with low runtime than existing methods for large datasets.",
keywords = "Concept Extraction, Text Classification, Twitter, Unsupervised Method",
author = "Yongrui Qin and Yihong Zhang and Claudia Szabo and Sheng, {Quan Z.} and Zhang, {Wei Emma}",
year = "2017",
month = "1",
day = "1",
doi = "10.1007/978-3-319-59536-8_9",
language = "English",
isbn = "9783319595351",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer Verlag",
pages = "127--143",
booktitle = "Advanced Information Systems Engineering - 29th International Conference, CAiSE 2017",

}

Qin, Y, Zhang, Y, Szabo, C, Sheng, QZ & Zhang, WE 2017, Identifying Domains and Concept in Short Texts Via Partial Taxonomy and Unlabeled Data. in Advanced Information Systems Engineering - 29th International Conference, CAiSE 2017. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10253 LNCS, Springer Verlag, pp. 127-143, 29th International Conference on Advanced Information Systems Engineering, Essen, Germany, 12/06/17. https://doi.org/10.1007/978-3-319-59536-8_9

Identifying Domains and Concept in Short Texts Via Partial Taxonomy and Unlabeled Data. / Qin, Yongrui; Zhang, Yihong; Szabo, Claudia; Sheng, Quan Z.; Zhang, Wei Emma.

Advanced Information Systems Engineering - 29th International Conference, CAiSE 2017. Springer Verlag, 2017. p. 127-143 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 10253 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Identifying Domains and Concept in Short Texts Via Partial Taxonomy and Unlabeled Data

AU - Qin, Yongrui

AU - Zhang, Yihong

AU - Szabo, Claudia

AU - Sheng, Quan Z.

AU - Zhang, Wei Emma

PY - 2017/1/1

Y1 - 2017/1/1

N2 - Accurate and real-time identification of domains and concepts discussed in microblogging texts is crucial for many important applications such as earthquake monitoring, influenza surveillance and disaster management. Existing techniques such as machine learning and keyword generation are application specific and require significant amount of training in order to achieve high accuracy. In this paper, we propose to use a multiple domain taxonomy (MDT) to capture general user knowledge. We formally define the problems of domain classification and concept tagging. Using the MDT, we devise domain-independent pure frequency count methods that do not require any training data nor annotations and that are not sensitive to misspellings or shortened word forms. Our extensive experimental analysis on real Twitter data shows that both methods have significantly better identification accuracy with low runtime than existing methods for large datasets.

AB - Accurate and real-time identification of domains and concepts discussed in microblogging texts is crucial for many important applications such as earthquake monitoring, influenza surveillance and disaster management. Existing techniques such as machine learning and keyword generation are application specific and require significant amount of training in order to achieve high accuracy. In this paper, we propose to use a multiple domain taxonomy (MDT) to capture general user knowledge. We formally define the problems of domain classification and concept tagging. Using the MDT, we devise domain-independent pure frequency count methods that do not require any training data nor annotations and that are not sensitive to misspellings or shortened word forms. Our extensive experimental analysis on real Twitter data shows that both methods have significantly better identification accuracy with low runtime than existing methods for large datasets.

KW - Concept Extraction

KW - Text Classification

KW - Twitter

KW - Unsupervised Method

UR - http://www.scopus.com/inward/record.url?scp=85021242048&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-59536-8_9

DO - 10.1007/978-3-319-59536-8_9

M3 - Conference contribution

SN - 9783319595351

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 127

EP - 143

BT - Advanced Information Systems Engineering - 29th International Conference, CAiSE 2017

PB - Springer Verlag

ER -

Qin Y, Zhang Y, Szabo C, Sheng QZ, Zhang WE. Identifying Domains and Concept in Short Texts Via Partial Taxonomy and Unlabeled Data. In Advanced Information Systems Engineering - 29th International Conference, CAiSE 2017. Springer Verlag. 2017. p. 127-143. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-319-59536-8_9