Abstract
Background: With the rapid development of science, more data is available to human beings. Therefore, the storage and calculation of big data have become the focus of scientific research. MapReduce performs well in the big data processing. However, it is prone to data skew, which affects the overall efficiency of the data processing cluster.
Objective: Aiming at the low efficiency of MapReduce data join, this paper proposes an intelligent data join load balancing algorithm based on dynamic programming. The algorithm introduces data sampling and partition algorithms. Due to the high performance of dynamic programming in the data constraint problem, it is used to solve the data skew problem intelligently.
Methods: Firstly, the causes of data skew are analyzed and the data partition method is improved. The algorithm introduces a data sampling method. In the task allocation stage, the multidimensional knapsack algorithm is used. Different key values are evenly divided to each computing node through the load cost. Finally, The performance of the improved algorithm is verified by experiments.
Results: The experimental results show that compared with the traditional load balancing algorithm and the existing improved algorithm, the new algorithm improves the data processing efficiency, reduces the data skew problem and better solves the problem of data load imbalance.
Conclusion: A two-table equivalent join load balancing algorithm based on key cost has been proposed. The algorithm creatively combines dynamic programming with intelligent data sampling, which greatly improves the efficiency and quality of data processing. The algorithm is worthy of popularization and application.
Keywords: Load balancing, data skew, equivalent join, intelligent sampling, multidimensional knapsack, greedy algorithm.
Graphical Abstract
[http://dx.doi.org/10.3233/JIFS-201220]
[http://dx.doi.org/10.1007/s40031-021-00651-0]
[http://dx.doi.org/10.2174/2213275912666190405141745]
[http://dx.doi.org/10.1007/s11227-018-2578-0]
[http://dx.doi.org/10.1016/j.ins.2018.11.007]
[http://dx.doi.org/10.1016/j.future.2019.09.040]
[http://dx.doi.org/10.1515/jisys-2018-0117]
[http://dx.doi.org/10.1109/TASE.2020.3014907]
[http://dx.doi.org/10.1016/j.ins.2019.10.062]
[http://dx.doi.org/10.1109/TBDATA.2020.2983650]
[http://dx.doi.org/10.1007/s12652-020-01707-7]
[http://dx.doi.org/10.1016/j.future.2020.09.009]
[http://dx.doi.org/10.1007/s00500-021-05607-6]
[http://dx.doi.org/10.1016/j.future.2019.09.041]
[http://dx.doi.org/10.1007/s10707-020-00414-x]
[http://dx.doi.org/10.1109/TCBB.2020.3000661] [PMID: 32750860]
[http://dx.doi.org/10.14569/IJACSA.2021.0121231]
[http://dx.doi.org/10.1109/ACCESS.2020.3007592]
[http://dx.doi.org/10.32604/cmc.2021.016081]
[http://dx.doi.org/10.1109/TBDATA.2019.2907624]
[http://dx.doi.org/10.1109/TPDS.2017.2772241]
[http://dx.doi.org/10.1080/1206212X.2019.1624314]
[http://dx.doi.org/10.1007/s11277-020-07050-6]
[http://dx.doi.org/10.1016/j.asej.2019.08.009]
[http://dx.doi.org/10.1109/TSMC.2019.2951789]
[http://dx.doi.org/10.1109/TPDS.2020.3029724]