Optimizing Data Locality by Executor Allocation in Spark Computing Environment

Zhongming Fu1, Mengsi He1, Zhuo Tang2 and Yang Zhang3

  1. Computer School, University of South China, and Hunan Provincial Base for Scientific and Technological Innovation Cooperation
    Hengyang, Hunan, China, 421001
    fuzhongming@hnu.edu.cn, mengsih@163.com
  2. College of Information Science and Engineering, Hunan University, and National Supercomputing Center
    Changsha, Hunan, China, 410082
    ztang@hnu.edu.cn
  3. Science and Technology on Parallel and Distributed Laboratory (PDL), National University of Defense Technology
    Changsha, Hunan, China, 410073
    yangzhang15@nudt.edu.cn

Abstract

Data locality is an important concept in big data processing. Most of the existing research optimized data locality from the aspect of task scheduling. However, as the execution container of tasks, the executors started on which nodes can directly affect the locality level achieved by the tasks. This paper tries to improve the data locality by executor allocation for reduce stage in Spark computing environment. Firstly, we calculate the network distance matrix of executors and formulate an optimal executor allocation problem to minimize the total communication distance. Then, when the network distance between executors satisfies the triangular inequality, an approximate algorithm is proposed; and when the network distance between executors does not satisfy the triangular inequality, a greedy algorithm is proposed. Finally, we evaluate the performance of our algorithms in a practical Spark cluster by using several representative micro-benchmarks (Sort and Join) and macro-benchmarks (PageRank and LDA). Experimental results show that the proposed algorithms can decrease the execution time of tasks for lower data communication.

Key words

communication distance, data locality, executor allocation, spark frame-work

Digital Object Identifier (DOI)

https://doi.org/10.2298/CSIS220131065F

Publication information

Volume 14, Issue 3 (September 2017)
Advances in Information Technology, Distributed and Model Driven Systems
Year of Publication: 2017
ISSN: 2406-1018 (Online)
Publisher: ComSIS Consortium

Full text

DownloadAvailable in PDF
Portable Document Format

How to cite

Fu, Z., He, M., Tang, Z., Zhang, Y.: Optimizing Data Locality by Executor Allocation in Spark Computing Environment. Computer Science and Information Systems, Vol. 14, No. 3, 491–512. (2017), https://doi.org/10.2298/CSIS220131065F