Generic placeholder image

Recent Advances in Electrical & Electronic Engineering

Editor-in-Chief

ISSN (Print): 2352-0965
ISSN (Online): 2352-0973

Research Article

Research on Load Balancing MapReduce Equivalent Join Based on Intelligent Sampling and Multi Knapsack Algorithm

Author(s): Cai Yang, Jizheng Yang, Songhao Jia*, Xing Chen and Yan Liu

Volume 15, Issue 4, 2022

Published on: 01 August, 2022

Page: [335 - 346] Pages: 12

DOI: 10.2174/2352096515666220603164248

Price: $65

Abstract

Background: With the rapid development of science, more data is available to human beings. Therefore, the storage and calculation of big data have become the focus of scientific research. MapReduce performs well in the big data processing. However, it is prone to data skew, which affects the overall efficiency of the data processing cluster.

Objective: Aiming at the low efficiency of MapReduce data join, this paper proposes an intelligent data join load balancing algorithm based on dynamic programming. The algorithm introduces data sampling and partition algorithms. Due to the high performance of dynamic programming in the data constraint problem, it is used to solve the data skew problem intelligently.

Methods: Firstly, the causes of data skew are analyzed and the data partition method is improved. The algorithm introduces a data sampling method. In the task allocation stage, the multidimensional knapsack algorithm is used. Different key values are evenly divided to each computing node through the load cost. Finally, The performance of the improved algorithm is verified by experiments.

Results: The experimental results show that compared with the traditional load balancing algorithm and the existing improved algorithm, the new algorithm improves the data processing efficiency, reduces the data skew problem and better solves the problem of data load imbalance.

Conclusion: A two-table equivalent join load balancing algorithm based on key cost has been proposed. The algorithm creatively combines dynamic programming with intelligent data sampling, which greatly improves the efficiency and quality of data processing. The algorithm is worthy of popularization and application.

Keywords: Load balancing, data skew, equivalent join, intelligent sampling, multidimensional knapsack, greedy algorithm.

« Previous
Graphical Abstract

[1]
S. Rababa, and A. Al-Badarneh, "Optimizations for filter-based join algorithms in MapReduce", J. Intell. Fuzzy Syst., vol. 40, pp. 8963-8980, 2021.
[http://dx.doi.org/10.3233/JIFS-201220]
[2]
T.H. Sardar, and Z. Ansari, "MapReduce-based fuzzy c-means algorithm for distributed document clustering", J. Instit. Eng. (India), vol. 103, pp. 131-142, 2022.
[http://dx.doi.org/10.1007/s40031-021-00651-0]
[3]
B. Singh, and H.K. Verma, "IMSM: An interval migration based approach for skew mitigation in Mapreduce", Recent Adv. Comput. Sci. Commun., vol. 14, pp. 71-81, 2021.
[http://dx.doi.org/10.2174/2213275912666190405141745]
[4]
G. Tian, N. Hao, M. Zhou, W. Pedrycz, C. Zhang, F. Ma, and Z. Li, "Fuzzy grey choquet integral for evaluation of multicriteria decision making problems with interactive and qualitative indices", IEEE Trans. Syst. Man Cybern. Syst., vol. 51, pp. 1855-1868, 2021.
[5]
E. Gavagsaz, A. Rezaee, and H.H.S. Javadi, "Load balancing in join algorithms for skewed data in MapReduce systems", J. Supercomput., vol. 75, pp. 228-254, 2019.
[http://dx.doi.org/10.1007/s11227-018-2578-0]
[6]
M.A. Irandoost, A.M. Rahmani, and S. Setayeshi, "A novel algorithm for handling reducer side data skew in MapReduce based on a learn-ing automata game", Inf. Sci., vol. 501, pp. 662-679, 2019.
[http://dx.doi.org/10.1016/j.ins.2018.11.007]
[7]
M.H.M. Pericini, L.G.M. Leite, F.H. Carvalho-Junior, and J.C. Machado, "M-Curves path planning model for mobile anchor node and localization of sensor nodes using dolphin swarm algorithm", Algorithms, vol. 12, pp. 1-14, 2020.
[8]
D.K. Tayal, and K. Meena, "A new MapReduce solution for associative classification to handle scalability and skewness in vertical data structure", Future Gener. Comput. Syst., vol. 103, pp. 44-57, 2020.
[http://dx.doi.org/10.1016/j.future.2019.09.040]
[9]
A.M. Fathollahi-Fard, M.N. Azari, and M. Hajiaghaei-Keshteli, "An improved red deer algorithm for addressing a direct current brushless motor design problem", Sci. Iran., vol. 28, pp. 1750-1764, 2021.
[10]
O. Kulkarni, S. Jena, and C.H. Sanjay, "Fractional fuzzy clustering and particle whale optimization-based mapreduce framework for big data clustering", J. Intelligent Systems, vol. 29, pp. 1496-1513, 2020.
[http://dx.doi.org/10.1515/jisys-2018-0117]
[11]
H.B. Abdalla, A.M. Ahmed, and M.A.A. Sibahee, "Optimization driven mapreduce framework for indexing and retrieval of big data", Trans. Internet Inf. Syst., vol. 14, pp. 1886-1908, 2020.
[12]
C. Zhang, G. Tian, A.M. Fathollahi-Fard, W. Wang, P. Wu, and Z. Li, "Interval-Valued intuitionistic uncertain linguistic cloud petri net and its application to risk assessment for subway fire accident", IEEE Trans. Autom. Sci. Eng., vol. 19, pp. 163-177, 2022.
[http://dx.doi.org/10.1109/TASE.2020.3014907]
[13]
A.M. Fathollahi-Fard, G. Tian, and Z. Li, "An adaptive Lagrangian relaxation-based algorithm for a coordinated water supply and wastewater collection network design problem", Inf. Sci., vol. 512, pp. 1335-1359, 2020.
[http://dx.doi.org/10.1016/j.ins.2019.10.062]
[14]
M. Shukla, D. Dharme, P. Ramnarain, R.D. Santos, and C.T. Lu, "DIGDUG: Scalable separable dense graph pruning and join operations in mapreduce", IEEE Trans. Big Data, vol. 7, pp. 930-951, 2021.
[http://dx.doi.org/10.1109/TBDATA.2020.2983650]
[15]
R. Jeyaraj, V.S. Ananthanarayana, and A. Paul, "Fine-grained data-locality aware MapReduce job scheduler in a virtualized environment", J. Ambient Intell. Humaniz. Comput., vol. 11, pp. 4261-4272, 2020.
[http://dx.doi.org/10.1007/s12652-020-01707-7]
[16]
G. Francisco, A. Corral, L. Iribarne, and M. Vassilakopoulos, "Improving distance-join query processing with voronoi-diagram based partitioning in spatialhadoop", Future Gener. Comput. Syst., vol. 11, pp. 723-740, 2020.
[17]
S.M. Marzuni, A. Savadi, A.N. Toosi, and M. Naghibzadeh, "Cross-MapReduce: Data transfer reduction in geo-distributed MapReduce", Future Gener. Comput. Syst., vol. 115, pp. 188-200, 2021.
[http://dx.doi.org/10.1016/j.future.2020.09.009]
[18]
A. Zahedi, A. Salehi-Amiri, M. Hajiaghaei-Keshteli, and A. Diabat, "Designing a closed-loop supply chain network considering multi-task sales agencies and multi-mode transportation", Soft Comput., vol. 25, pp. 6203-6235, 2021.
[http://dx.doi.org/10.1007/s00500-021-05607-6]
[19]
N. Akbarpour, M. Hajiaghaei-Keshteli, and R. Tavakkoli-Moghaddam, "New approaches in meta-heuristics to schedule purposeful inspec-tions of workshops in manufacturing supply chains", Int. J. Eng. Trans. B. Appl., vol. 33, pp. 833-840, 2020.
[20]
S. Dolev, P. Gupta, Y. Li, S. Mehrotra, and S. Sharma, "Privacy-Preserving secret shared computations using mapreduce", IEEE Trans. Depend. Secure Comput., vol. 18, pp. 1645-1666, 2021.
[21]
J. Ragaventhiran, and M.K. Kavithadevi, "Map-optimize-reduce: CAN tree assisted FP-growth algorithm for clusters based FP mining on Hadoop", Future Gener. Comput. Syst., vol. 103, pp. 111-122, 2020.
[http://dx.doi.org/10.1016/j.future.2019.09.041]
[22]
A. Belussi, S. Migliorini, and A. Eldawy, "Cost estimation of spatial join in spatialhadoop", GeoInformatica, vol. 24, pp. 1021-1059, 2020.
[http://dx.doi.org/10.1007/s10707-020-00414-x]
[23]
R. Sinha, R.K. Pal, and R.K. De, "“GenSeg and MR-GenSeg: A novel segmentation algorithm and its parallel mapreduce based approach for identifying genomic regions with copy number variations”, IEEE/ACM Trans", Comput. Biol. Bioinform., vol. 19, no. 1, pp. 443-454, 2022.
[http://dx.doi.org/10.1109/TCBB.2020.3000661] [PMID: 32750860]
[24]
E. Rslan, R.M. Badry, M.H. Khafagy, and K. Munir, "English semantic similarity based on map reduce classification for agricultural com-plaints", Int. J. Adv. Comput. Sci. Appl., vol. 12, pp. 235-242, 2021.
[http://dx.doi.org/10.14569/IJACSA.2021.0121231]
[25]
J. Ahn, and D.H. Im, "Efficient access control of large scale rdf data using prefix-based labeling", IEEE Access, vol. 8, pp. 122405-122412, 2020.
[http://dx.doi.org/10.1109/ACCESS.2020.3007592]
[26]
M. Aksa, J. Rashid, M.W. Nisar, T. Mahmood, H.Y. Kwon, and A. Hussain, "Bitmapaligner: Bit-parallelism string matching withmapre-duce and hadoop", Comput. Mater. Continua, vol. 68, pp. 3931-3946, 2021.
[http://dx.doi.org/10.32604/cmc.2021.016081]
[27]
J.A.D. Santos, T.I. Syed, M.C. Naldi, R.J.G.B. Campello, and J. Sander, "Hierarchical density-based clustering using mapreduce", IEEE Trans. Big Data, vol. 7, pp. 102-114, 2021.
[http://dx.doi.org/10.1109/TBDATA.2019.2907624]
[28]
C.D. Briggs, Systems and methods for voice network control and optimization. U.S.20220086679, 2017.
[29]
H. Sawada, Intelligent power module, electric vehicle or hybrid vehicle, and method of assembling intelligent power module.US, . US 11273818, 2015.
[30]
R. Fang, J.C. Gaiteri, B.H. Lord, Y. Chiang, and R.P. Chiarello, Sothermal amplification on with electrical detection.U.S., . U.S. 20220073975, 2022.
[31]
Y.C. Chang, J. Crawford, L.L. Fong, and W. Tan, Index maintenance based on a comparison of rebuild vs. update. U.S.20200142879, 2018.
[32]
C.J. Aversano, and T.J. Stachowiak, Graphical user interface driven programming development environment. U.S.11269889, 2022.
[33]
T.R. Kyaw, J. Ji, S. Mufti, S. Achuthan, and S.C. Song, Systems and methods for dynamic partitioning in distributed environments. U.S.20200159594, 2020.
[34]
G. Stefano, and D. Rossin, Improvements in the welding of pipes. U.S.20220063019, 2019.
[35]
A. Akita, and M. Shimohatsubo, Slotter apparatus and machine for manufacture of carton. U.S.20220048269, 2022.
[36]
G.A. Murray, and L.E. Rivas, Techniques for facilitating the joining of datasets. U.S.20200210417, 2020.
[37]
T.L. Harris, and M.C. Maas, Coordinated garbage collection in distributed systems. U.S.20200257573, 2020.
[38]
S. Vlaskalic, and D. Boljevic, Pocket-size folding device with integrated electrodes for recording, processing and transmission with three ecg leads. U.S.20220039726, 2022.
[39]
U. Suthakar, L. Magnoni, D.R. Smith, and A. Khan, "Optimised lambda architecture for monitoring scientific infrastructure", IEEE Trans. Parallel Distrib. Syst., vol. 32, pp. 1395-1408, 2021.
[http://dx.doi.org/10.1109/TPDS.2017.2772241]
[40]
M. Nakamura, D. Kinjo, and T. Yoshida, "A MapReduce algorithm for minimum vertex cover problems and its randomization", Comput. Inf., vol. 39, pp. 952-972, 2021.
[41]
S. Badri, "A novel Map-Scan-Reduce based density peaks clustering and privacy protection approach for large datasets", Int. J. Comput. Appl., vol. 43, pp. 663-673, 2021.
[http://dx.doi.org/10.1080/1206212X.2019.1624314]
[42]
S. Nithyanantham, and G. Singaravel, "Resource and cost aware glowworm mapreduce optimization based big data processing in geo dis-tributed data center", Int. J. Mobile Netw. Des. Innov. Wirel. Pers. Commun., vol. 117, pp. 2831-2852, 2021.
[http://dx.doi.org/10.1007/s11277-020-07050-6]
[43]
D. Medhat, A.H. Yousef, and C. Salama, "Cost-aware load balancing for multilingual record linkage using MapReduce", Ain Shams Eng. J., vol. 11, pp. 419-433, 2020.
[http://dx.doi.org/10.1016/j.asej.2019.08.009]
[44]
T.Y. Mu, A. Ala, and K. Salah, "Automating the configuration of mapreduce: A reinforcement learning scheme", IEEE Trans. Syst. Man Cybern. Syst., vol. 50, pp. 4183-4196, 2020.
[http://dx.doi.org/10.1109/TSMC.2019.2951789]
[45]
T.R. Toha, A.S.M. Rizvi, J. Noor, M.A. Adnan, and A.B.M.A. Islam, "Towards greening mapreduce clusters considering both computation energy and cooling energy", IEEE Trans. Parallel Distrib. Syst., vol. 32, pp. 931-942, 2021.
[http://dx.doi.org/10.1109/TPDS.2020.3029724]

Rights & Permissions Print Cite
© 2025 Bentham Science Publishers | Privacy Policy