TY - JOUR
T1 - Big data directed acyclic graph model for real-time COVID-19 twitter stream detection
AU - Amen, Bakhtiar
AU - Faiz, Syahirul
AU - Do, Thanh Toan
N1 - Funding Information:
The research was undertaken by Syahirul Faiz, sponsored by the Indonesia Endowment Fund for Education (LPDP) for his study at the University of Liverpool and supervised by Dr Bakhtiar Amen and Dr Thanh-Toan Do.
Publisher Copyright:
© 2021
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
PY - 2022/3/1
Y1 - 2022/3/1
N2 - Every day, large-scale data are continuously generated on social media as streams, such as Twitter, which inform us about all events around the world in real-time. Notably, Twitter is one of the effective platforms to update countries leaders and scientists during the coronavirus (COVID-19) pandemic. Other people have also used this platform to post their concerns about the spread of this virus and a rapid increase of death cases globally. The aim of this work is to detect anomalous events associated with COVID-19 from Twitter. To this end, we propose a distributed Directed Acyclic Graph topology framework to aggregate and process large-scale real-time tweets related to COVID-19. The core of our system is a novel lightweight algorithm that can automatically detect anomaly events. In addition, our system can also identify, cluster, and visualize important keywords in tweets. On 18 August 2020, our model detected the highest anomaly since many tweets mentioned the casualties’ updates and the debates on the pandemic that day. We obtained the three most commonly listed terms on Twitter: “covid”, “death”, and “Trump” (21,566, 11,779, and 4761 occurrences, respectively), with the highest TF-IDF score for these terms: “people” (0.63637), “school” (0.5921407) and “virus” (0.57385). From our clustering result, the word “death”, “corona”, and “case” are grouped into one cluster, where the word “pandemic”, “school”, and “president” are grouped as another cluster. These terms were located near each other on vector space so that they were clustered, indicating people's most concerned topics on Twitter.
AB - Every day, large-scale data are continuously generated on social media as streams, such as Twitter, which inform us about all events around the world in real-time. Notably, Twitter is one of the effective platforms to update countries leaders and scientists during the coronavirus (COVID-19) pandemic. Other people have also used this platform to post their concerns about the spread of this virus and a rapid increase of death cases globally. The aim of this work is to detect anomalous events associated with COVID-19 from Twitter. To this end, we propose a distributed Directed Acyclic Graph topology framework to aggregate and process large-scale real-time tweets related to COVID-19. The core of our system is a novel lightweight algorithm that can automatically detect anomaly events. In addition, our system can also identify, cluster, and visualize important keywords in tweets. On 18 August 2020, our model detected the highest anomaly since many tweets mentioned the casualties’ updates and the debates on the pandemic that day. We obtained the three most commonly listed terms on Twitter: “covid”, “death”, and “Trump” (21,566, 11,779, and 4761 occurrences, respectively), with the highest TF-IDF score for these terms: “people” (0.63637), “school” (0.5921407) and “virus” (0.57385). From our clustering result, the word “death”, “corona”, and “case” are grouped into one cluster, where the word “pandemic”, “school”, and “president” are grouped as another cluster. These terms were located near each other on vector space so that they were clustered, indicating people's most concerned topics on Twitter.
KW - Anomaly detection
KW - Big data
KW - COVID-19
KW - Directed acyclic graph
KW - Event stream
UR - http://www.scopus.com/inward/record.url?scp=85119054993&partnerID=8YFLogxK
U2 - 10.1016/j.patcog.2021.108404
DO - 10.1016/j.patcog.2021.108404
M3 - Article
AN - SCOPUS:85119054993
VL - 123
JO - Pattern Recognition
JF - Pattern Recognition
SN - 0031-3203
M1 - 108404
ER -