Data quality assessment and anomaly detection via map/reduce and linked data

A case study in the medical domain

Stephen Bonner, Andrew Stephen McGough, Ibad Kureshi, John Brennan, Georgios Theodoropoulos, Laura Moss, David Corsar, Grigoris Antoniou

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Recent technological advances in modern healthcare have lead to the ability to collect a vast wealth of patient monitoring data. This data can be utilised for patient diagnosis but it also holds the potential for use within medical research. However, these datasets often contain errors which limit their value to medical research, with one study finding error rates ranging from 2.3%-26.9% in a selection of medical databases. Previous methods for automatically assessing data quality normally rely on threshold rules, which are often unable to correctly identify errors, as further complex domain knowledge is required. To combat this, a semantic web based framework has previously been developed to assess the quality of medical data. However, early work, based solely on traditional semantic web technologies, revealed they are either unable or inefficient at scaling to the vast volumes of medical data. In this paper we present a new method for storing and querying medical RDF datasets using Hadoop Map / Reduce. This approach exploits the inherent parallelism found within RDF datasets and queries, allowing us to scale with both dataset and system size. Unlike previous solutions, this framework uses highly optimised (SPARQL) joining strategies, intelligent data caching and the use of a super-query to enable the completion of eight distinct SPARQL lookups, comprising over eighty distinct joins, in only two Map / Reduce iterations. Results are presented comparing both the Jena and a previous Hadoop implementation demonstrating the superior performance of the new methodology. The new method is shown to be five times faster than Jena and twice as fast as the previous approach.

Original languageEnglish
Title of host publicationProceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages737-746
Number of pages10
ISBN (Electronic)9781479999255
DOIs
Publication statusPublished - 22 Dec 2015
Event3rd IEEE International Conference on Big Data - Santa Clara, United States
Duration: 29 Oct 20151 Nov 2015

Conference

Conference3rd IEEE International Conference on Big Data
Abbreviated titleIEEE 2015
CountryUnited States
CitySanta Clara
Period29/10/151/11/15

Fingerprint

Semantic Web
Patient monitoring
Joining

Cite this

Bonner, S., McGough, A. S., Kureshi, I., Brennan, J., Theodoropoulos, G., Moss, L., ... Antoniou, G. (2015). Data quality assessment and anomaly detection via map/reduce and linked data: A case study in the medical domain. In Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015 (pp. 737-746). [7363818] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigData.2015.7363818
Bonner, Stephen ; McGough, Andrew Stephen ; Kureshi, Ibad ; Brennan, John ; Theodoropoulos, Georgios ; Moss, Laura ; Corsar, David ; Antoniou, Grigoris. / Data quality assessment and anomaly detection via map/reduce and linked data : A case study in the medical domain. Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015. Institute of Electrical and Electronics Engineers Inc., 2015. pp. 737-746
@inproceedings{0ba4095b3fe7406ba1730cbe08a068bf,
title = "Data quality assessment and anomaly detection via map/reduce and linked data: A case study in the medical domain",
abstract = "Recent technological advances in modern healthcare have lead to the ability to collect a vast wealth of patient monitoring data. This data can be utilised for patient diagnosis but it also holds the potential for use within medical research. However, these datasets often contain errors which limit their value to medical research, with one study finding error rates ranging from 2.3{\%}-26.9{\%} in a selection of medical databases. Previous methods for automatically assessing data quality normally rely on threshold rules, which are often unable to correctly identify errors, as further complex domain knowledge is required. To combat this, a semantic web based framework has previously been developed to assess the quality of medical data. However, early work, based solely on traditional semantic web technologies, revealed they are either unable or inefficient at scaling to the vast volumes of medical data. In this paper we present a new method for storing and querying medical RDF datasets using Hadoop Map / Reduce. This approach exploits the inherent parallelism found within RDF datasets and queries, allowing us to scale with both dataset and system size. Unlike previous solutions, this framework uses highly optimised (SPARQL) joining strategies, intelligent data caching and the use of a super-query to enable the completion of eight distinct SPARQL lookups, comprising over eighty distinct joins, in only two Map / Reduce iterations. Results are presented comparing both the Jena and a previous Hadoop implementation demonstrating the superior performance of the new methodology. The new method is shown to be five times faster than Jena and twice as fast as the previous approach.",
keywords = "Joins, Map / Reduce, Medical Data, RDF",
author = "Stephen Bonner and McGough, {Andrew Stephen} and Ibad Kureshi and John Brennan and Georgios Theodoropoulos and Laura Moss and David Corsar and Grigoris Antoniou",
year = "2015",
month = "12",
day = "22",
doi = "10.1109/BigData.2015.7363818",
language = "English",
pages = "737--746",
booktitle = "Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

Bonner, S, McGough, AS, Kureshi, I, Brennan, J, Theodoropoulos, G, Moss, L, Corsar, D & Antoniou, G 2015, Data quality assessment and anomaly detection via map/reduce and linked data: A case study in the medical domain. in Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015., 7363818, Institute of Electrical and Electronics Engineers Inc., pp. 737-746, 3rd IEEE International Conference on Big Data, Santa Clara, United States, 29/10/15. https://doi.org/10.1109/BigData.2015.7363818

Data quality assessment and anomaly detection via map/reduce and linked data : A case study in the medical domain. / Bonner, Stephen; McGough, Andrew Stephen; Kureshi, Ibad; Brennan, John; Theodoropoulos, Georgios; Moss, Laura; Corsar, David; Antoniou, Grigoris.

Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015. Institute of Electrical and Electronics Engineers Inc., 2015. p. 737-746 7363818.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Data quality assessment and anomaly detection via map/reduce and linked data

T2 - A case study in the medical domain

AU - Bonner, Stephen

AU - McGough, Andrew Stephen

AU - Kureshi, Ibad

AU - Brennan, John

AU - Theodoropoulos, Georgios

AU - Moss, Laura

AU - Corsar, David

AU - Antoniou, Grigoris

PY - 2015/12/22

Y1 - 2015/12/22

N2 - Recent technological advances in modern healthcare have lead to the ability to collect a vast wealth of patient monitoring data. This data can be utilised for patient diagnosis but it also holds the potential for use within medical research. However, these datasets often contain errors which limit their value to medical research, with one study finding error rates ranging from 2.3%-26.9% in a selection of medical databases. Previous methods for automatically assessing data quality normally rely on threshold rules, which are often unable to correctly identify errors, as further complex domain knowledge is required. To combat this, a semantic web based framework has previously been developed to assess the quality of medical data. However, early work, based solely on traditional semantic web technologies, revealed they are either unable or inefficient at scaling to the vast volumes of medical data. In this paper we present a new method for storing and querying medical RDF datasets using Hadoop Map / Reduce. This approach exploits the inherent parallelism found within RDF datasets and queries, allowing us to scale with both dataset and system size. Unlike previous solutions, this framework uses highly optimised (SPARQL) joining strategies, intelligent data caching and the use of a super-query to enable the completion of eight distinct SPARQL lookups, comprising over eighty distinct joins, in only two Map / Reduce iterations. Results are presented comparing both the Jena and a previous Hadoop implementation demonstrating the superior performance of the new methodology. The new method is shown to be five times faster than Jena and twice as fast as the previous approach.

AB - Recent technological advances in modern healthcare have lead to the ability to collect a vast wealth of patient monitoring data. This data can be utilised for patient diagnosis but it also holds the potential for use within medical research. However, these datasets often contain errors which limit their value to medical research, with one study finding error rates ranging from 2.3%-26.9% in a selection of medical databases. Previous methods for automatically assessing data quality normally rely on threshold rules, which are often unable to correctly identify errors, as further complex domain knowledge is required. To combat this, a semantic web based framework has previously been developed to assess the quality of medical data. However, early work, based solely on traditional semantic web technologies, revealed they are either unable or inefficient at scaling to the vast volumes of medical data. In this paper we present a new method for storing and querying medical RDF datasets using Hadoop Map / Reduce. This approach exploits the inherent parallelism found within RDF datasets and queries, allowing us to scale with both dataset and system size. Unlike previous solutions, this framework uses highly optimised (SPARQL) joining strategies, intelligent data caching and the use of a super-query to enable the completion of eight distinct SPARQL lookups, comprising over eighty distinct joins, in only two Map / Reduce iterations. Results are presented comparing both the Jena and a previous Hadoop implementation demonstrating the superior performance of the new methodology. The new method is shown to be five times faster than Jena and twice as fast as the previous approach.

KW - Joins

KW - Map / Reduce

KW - Medical Data

KW - RDF

UR - http://www.scopus.com/inward/record.url?scp=84963747866&partnerID=8YFLogxK

U2 - 10.1109/BigData.2015.7363818

DO - 10.1109/BigData.2015.7363818

M3 - Conference contribution

SP - 737

EP - 746

BT - Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Bonner S, McGough AS, Kureshi I, Brennan J, Theodoropoulos G, Moss L et al. Data quality assessment and anomaly detection via map/reduce and linked data: A case study in the medical domain. In Proceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015. Institute of Electrical and Electronics Engineers Inc. 2015. p. 737-746. 7363818 https://doi.org/10.1109/BigData.2015.7363818