Abstract
With the availability of inexpensive devices like storage and data sensors, collecting and storing data is now simpler than ever. Biotechnology, pharmacy, business, online marketing websites, Twitter, Facebook, and blogs are some of the sources of the data. Understanding the data is crucial today as every business activity from private to public, from hospitals to mega mart benefits from this. However, due to the explosive volume of data, it is becoming almost impossible to decipher the data manually. We are creating 2.5 quintillion bytes per day in 2022. One quintillion byte is one billion Gigabytes. Approximately, 90% of the total data is created in the last two years. Naturally, an automatic technique to analyze the data is a necessity today. Therefore, data mining is performed with the help of machine learning tools to analyze and understand the data. Data Mining and Machine Learning are heavily dependent on statistical tools and techniques. Therefore, we sometimes use the term – “Statistical Learning” for Machine Learning. Many machine learning techniques exist in the literature and improvement is a continuous process as no model is perfect. This paper examines the influence of variance, a statistical concept, on various machine learning approaches and tries to understand how this concept can be used to improve performance.
Keywords: Statistical learning, machine learning, data mining, variance, k-distance, KNN.
Graphical Abstract
[http://dx.doi.org/10.1007/978-0-387-84858-7]
[http://dx.doi.org/10.1007/978-1-4614-7138-7]
[http://dx.doi.org/10.1147/rd.33.0210]
[http://dx.doi.org/10.1037/h0059815]
[http://dx.doi.org/10.1145/203330.203343]
[http://dx.doi.org/10.1177/0004867412444624] [PMID: 22528974]
[http://dx.doi.org/10.1002/9781118445112.stat07975]
[http://dx.doi.org/10.1117/12.878256]
[http://dx.doi.org/10.1201/9781315139470]
[http://dx.doi.org/10.1145/361002.361007]
[http://dx.doi.org/10.1080/14786440109462720]
[http://dx.doi.org/10.1175/MWR-D-13-00032.1]
[http://dx.doi.org/10.1007/978-3-642-17857-3_53]
[http://dx.doi.org/10.1016/j.aap.2021.106514]
[http://dx.doi.org/10.abs/1804.05092]
[http://dx.doi.org/10.1016/j.eswa.2021.115191]
[http://dx.doi.org/10.1016/j.cpc.2009.09.018]
[http://dx.doi.org/10.1016/j.chemolab.2020.104194]
[http://dx.doi.org/10.1016/j.ijar.2019.11.011]
[http://dx.doi.org/10.1002/adts.202000291]
[http://dx.doi.org/10.abs/2101.07561]
[http://dx.doi.org/10.1109/IJCNN.2018.8489279]
[http://dx.doi.org/10.3390/e23020257] [PMID: 33672252]
[http://dx.doi.org/10.1145/1639714.1639722]
[http://dx.doi.org/10.1007/s10994-013-5373-4]
[http://dx.doi.org/10.1109/ACCESS.2019.2899578]
[http://dx.doi.org/10.3844/jcssp.2011.1393.1399]
[http://dx.doi.org/10.1142/9789814261302_0021]
[http://dx.doi.org/10.1109/IFOST.2007.4798542]
[http://dx.doi.org/10.1080/03081079008935108]
[http://dx.doi.org/10.1109/TSMC.1985.6313426]
[http://dx.doi.org/10.1007/BF00994018]
[http://dx.doi.org/10.1109/72.991432] [PMID: 18244447]
[http://dx.doi.org/10.1109/TFUZZ.2019.2893863]
[http://dx.doi.org/10.1016/j.patcog.2019.107078]
[http://dx.doi.org/10.1016/S0898-1221(99)00056-5]
[http://dx.doi.org/10.1109/ACCESS.2021.3089849]