Identifying Domains and Concept in Short Texts Via Partial Taxonomy and Unlabeled Data

Yongrui Qin, Yihong Zhang, Claudia Szabo, Quan Z. Sheng, Wei Emma Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

3 Citations (Scopus)


Accurate and real-time identification of domains and concepts discussed in microblogging texts is crucial for many important applications such as earthquake monitoring, influenza surveillance and disaster management. Existing techniques such as machine learning and keyword generation are application specific and require significant amount of training in order to achieve high accuracy. In this paper, we propose to use a multiple domain taxonomy (MDT) to capture general user knowledge. We formally define the problems of domain classification and concept tagging. Using the MDT, we devise domain-independent pure frequency count methods that do not require any training data nor annotations and that are not sensitive to misspellings or shortened word forms. Our extensive experimental analysis on real Twitter data shows that both methods have significantly better identification accuracy with low runtime than existing methods for large datasets.

Original languageEnglish
Title of host publicationAdvanced Information Systems Engineering - 29th International Conference, CAiSE 2017
PublisherSpringer Verlag
Number of pages17
ISBN (Print)9783319595351
Publication statusPublished - 1 Jan 2017
Event29th International Conference on Advanced Information Systems Engineering - Essen, Germany
Duration: 12 Jun 201716 Jun 2017
Conference number: 29 (Link to Conference Website)

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume10253 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349


Conference29th International Conference on Advanced Information Systems Engineering
Abbreviated titleCAISE'17
Internet address


Dive into the research topics of 'Identifying Domains and Concept in Short Texts Via Partial Taxonomy and Unlabeled Data'. Together they form a unique fingerprint.

Cite this