Design and evaluation of small-large outer joins in cloud computing environments

Long Cheng, Ilias Tachmazidis, Spyros Kotoulas, Grigoris Antoniou

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

Large-scale analytics is a key application area for data processing and parallel computing research. One of the most common (and challenging) operations in this domain is the join. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially in the extremely popular cloud computing environments. A common type of outer join is the small-large outer join, where one relation is relatively small and the other is large. Conventional implementations on this condition, such as one based on hash redistribution, often incur significant network communication, while the duplication-based approaches are complex and inefficient. In this work, we present a new method called DDR (duplication and direct redistribution), which aims to enable efficient small-large outer joins in cloud computing environments while being easy to implement using existing predicates in data processing frameworks. We present the detailed implementation of our approach and evaluate its performance through extensive experiments over the widely used MapReduce and Spark platforms. We show that the proposed method is scalable and can achieve significant performance improvements over the conventional approaches. Compared to the state-of-art method, the DDR algorithm is shown to be easier to implement and can achieve very similar or better performance under different outer join workloads, and thus, can be considered as a new option for current data analysis applications. Moreover, our detailed experimental results also have provided insights of current small-large outer join implementations, thereby allowing system developers to make a more informed choice for their data analysis applications.

LanguageEnglish
Pages2-15
Number of pages14
JournalJournal of Parallel and Distributed Computing
Volume110
Early online date6 Mar 2017
DOIs
Publication statusPublished - Dec 2017

Fingerprint

Cloud computing
Cloud Computing
Join
Evaluation
Redistribution
Duplication
Parallel processing systems
Electric sparks
Telecommunication networks
Data analysis
Network Communication
MapReduce
Design
Parallel Systems
Parallel Computing
Predicate
Workload
Distributed Systems
Experiments
Evaluate

Cite this

@article{0a16093aa64b4d1f927a589c94d621b6,
title = "Design and evaluation of small-large outer joins in cloud computing environments",
abstract = "Large-scale analytics is a key application area for data processing and parallel computing research. One of the most common (and challenging) operations in this domain is the join. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially in the extremely popular cloud computing environments. A common type of outer join is the small-large outer join, where one relation is relatively small and the other is large. Conventional implementations on this condition, such as one based on hash redistribution, often incur significant network communication, while the duplication-based approaches are complex and inefficient. In this work, we present a new method called DDR (duplication and direct redistribution), which aims to enable efficient small-large outer joins in cloud computing environments while being easy to implement using existing predicates in data processing frameworks. We present the detailed implementation of our approach and evaluate its performance through extensive experiments over the widely used MapReduce and Spark platforms. We show that the proposed method is scalable and can achieve significant performance improvements over the conventional approaches. Compared to the state-of-art method, the DDR algorithm is shown to be easier to implement and can achieve very similar or better performance under different outer join workloads, and thus, can be considered as a new option for current data analysis applications. Moreover, our detailed experimental results also have provided insights of current small-large outer join implementations, thereby allowing system developers to make a more informed choice for their data analysis applications.",
keywords = "Cloud computing, Outer joins, Parallel joins, Performance evaluation, Small-large joins",
author = "Long Cheng and Ilias Tachmazidis and Spyros Kotoulas and Grigoris Antoniou",
year = "2017",
month = "12",
doi = "10.1016/j.jpdc.2017.02.007",
language = "English",
volume = "110",
pages = "2--15",
journal = "Journal of Parallel and Distributed Computing",
issn = "0743-7315",
publisher = "Academic Press Inc.",

}

Design and evaluation of small-large outer joins in cloud computing environments. / Cheng, Long; Tachmazidis, Ilias; Kotoulas, Spyros; Antoniou, Grigoris.

In: Journal of Parallel and Distributed Computing, Vol. 110, 12.2017, p. 2-15.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Design and evaluation of small-large outer joins in cloud computing environments

AU - Cheng, Long

AU - Tachmazidis, Ilias

AU - Kotoulas, Spyros

AU - Antoniou, Grigoris

PY - 2017/12

Y1 - 2017/12

N2 - Large-scale analytics is a key application area for data processing and parallel computing research. One of the most common (and challenging) operations in this domain is the join. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially in the extremely popular cloud computing environments. A common type of outer join is the small-large outer join, where one relation is relatively small and the other is large. Conventional implementations on this condition, such as one based on hash redistribution, often incur significant network communication, while the duplication-based approaches are complex and inefficient. In this work, we present a new method called DDR (duplication and direct redistribution), which aims to enable efficient small-large outer joins in cloud computing environments while being easy to implement using existing predicates in data processing frameworks. We present the detailed implementation of our approach and evaluate its performance through extensive experiments over the widely used MapReduce and Spark platforms. We show that the proposed method is scalable and can achieve significant performance improvements over the conventional approaches. Compared to the state-of-art method, the DDR algorithm is shown to be easier to implement and can achieve very similar or better performance under different outer join workloads, and thus, can be considered as a new option for current data analysis applications. Moreover, our detailed experimental results also have provided insights of current small-large outer join implementations, thereby allowing system developers to make a more informed choice for their data analysis applications.

AB - Large-scale analytics is a key application area for data processing and parallel computing research. One of the most common (and challenging) operations in this domain is the join. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially in the extremely popular cloud computing environments. A common type of outer join is the small-large outer join, where one relation is relatively small and the other is large. Conventional implementations on this condition, such as one based on hash redistribution, often incur significant network communication, while the duplication-based approaches are complex and inefficient. In this work, we present a new method called DDR (duplication and direct redistribution), which aims to enable efficient small-large outer joins in cloud computing environments while being easy to implement using existing predicates in data processing frameworks. We present the detailed implementation of our approach and evaluate its performance through extensive experiments over the widely used MapReduce and Spark platforms. We show that the proposed method is scalable and can achieve significant performance improvements over the conventional approaches. Compared to the state-of-art method, the DDR algorithm is shown to be easier to implement and can achieve very similar or better performance under different outer join workloads, and thus, can be considered as a new option for current data analysis applications. Moreover, our detailed experimental results also have provided insights of current small-large outer join implementations, thereby allowing system developers to make a more informed choice for their data analysis applications.

KW - Cloud computing

KW - Outer joins

KW - Parallel joins

KW - Performance evaluation

KW - Small-large joins

UR - http://www.scopus.com/inward/record.url?scp=85015637447&partnerID=8YFLogxK

U2 - 10.1016/j.jpdc.2017.02.007

DO - 10.1016/j.jpdc.2017.02.007

M3 - Article

VL - 110

SP - 2

EP - 15

JO - Journal of Parallel and Distributed Computing

T2 - Journal of Parallel and Distributed Computing

JF - Journal of Parallel and Distributed Computing

SN - 0743-7315

ER -