题名

基於Apache Spark異構叢集系統任務排程優化之研究

并列篇名

The Study of Task Scheduling Optimization on Apache Spark Heterogeneous Cluster System

作者

吳秉諭

关键词

Apache Spark ; 雲端運算 ; 巨量資料 ; 調度策略 ; 任務排程調度策略 ; 異構叢集 ; Apache Spark ; Cloud Computing ; Big Data ; Dispatching Strategy ; Task Scheduling and Dispatching strategy ; Heterogenic Clusters

期刊名称

臺中科技大學資訊工程系碩士班學位論文

卷期/出版年月

2017年

学位类别

碩士

导师

陳弘明

内容语文

繁體中文

中文摘要

隨著硬體設備的升級,企業的數據中心的硬體資源不斷的更新換代,新加入的節點使得叢集產生異構性,由於異構的叢集各個節點計算能力不一致,因此當同一個任務分配到不同節點上運算,將會對節點的負載造成不同的影響。另外,對於任務也存在著異構性的問題,如CPU密集型和I/O密集型任務對於計算資源需求不一致,在分配節點時也應考慮,以Apache Spark大數據計算框架為例,在分配任務時,並不考慮到叢集節點的異構性以及節點資源利用情況,因此,導致叢集中各個節點在任務執行時出現負載平衡不平均的問題,導致一部分的資源消耗過載,使得整體效率受限於弱節點,導致整體任務計算效能下降。針對上述問題,本研究提出了一種新的調度策略以優化Spark在異構叢集的表現,提出了新的分層排程調度方法,先透過分群的方式,將相近計算能力的節點組成叢集,而在調度時運用測試任務來進行初步任務執行時間的推估,而後利用歷史數據與機器學習方法更準確的預估任務執行時間,以實現更高效率的任務調度算法。

英文摘要

As the hardware equipment is continue to be upgraded, the hardware resources of the company data center are constantly renewed and replaced. Consequently, newly added nodes cause the cluster nodes to express heterogeneity. However, due to cluster heterogeneity, the processing abilities of the each node are different. As a result, when a task is assigned to different nodes, it affects the loading of each node differently. In addition, heterogeneity causes issues in tasks themselves as well. For example: When assigning nodes, one should consider the need for different resources of CPU intensives and I/O intensives. Take the Apache Spark data analytic framework as an example, the current Spark does not take heterogeneity nor resource utilization of cluster nodes into consideration. Therefore, the nodes of each cluster demonstrate uneven loading when they are performing tasks. This causes partial system overloading and resource depletion, and limits the overall efficiency to the lesser capable nodes. As a result, overall computational performance drops. In order to counteract the problems discussed above, this study suggests a new scheduling strategy that can optimize Spark’s performance in relation to heterogeneos clusters. This study proposes a new hierarchical scheduling strategy that first divides nodes with similar calculating abilities into groups. During this process, test assignments are used to assess preliminary executing time. Then, historical data and machine learning techniques are used to further accurately estimate the execution time. Finally, with the strategy explained above, a more efficient task scheduling algorithm are proposed and implemented.

主题分类 基礎與應用科學 > 資訊科學
資訊與流通學院 > 資訊工程系碩士班
参考文献
  1. [1] Jo, M., Maksymyuk, T., Strykhalyuk, B., & Cho, C. H. (2015). Device-to-device-based heterogeneous radio access network architecture for mobile cloud computing. IEEE Wireless Communications, 22(3), 50-58.
    連結:
  2. [2] Beloglazov, A., Buyya, R., Lee, Y. C., & Zomaya, A. (2011). A taxonomy and survey of energy-efficient data centers and cloud computing systems. Advances in computers, 82(2), 47-111.
    連結:
  3. [5] Vasile, M. A., Pop, F., Tutueanu, R. I., Cristea, V., & Kołodziej, J. (2015). Resource-aware hybrid scheduling algorithm in heterogeneous distributed computing. Future Generation Computer Systems, 51, 61-71.
    連結:
  4. [6] Di Nitto, E., Ghezzi, C., Metzger, A., Papazoglou, M., & Pohl, K. (2008). A journey to highly dynamic, self-adaptive service-based applications. Automated Software Engineering, 15(3), 313-341.
    連結:
  5. [7] Xu, X., Cao, L., & Wang, X. (2016). Adaptive task scheduling strategy based on dynamic workload adjustment for heterogeneous Hadoop clusters. IEEE Systems Journal, 10(2), 471-482.
    連結:
  6. [8] Nightingale, E. B., Chen, P. M., & Flinn, J. (2006). Speculative execution in a distributed file system. ACM Transactions on Computer Systems (TOCS), 24(4), 361-392.
    連結:
  7. [11] Wang, W., Zhao, W., Cai, C., Huang, J., Xu, X., & Li, L. (2015). An efficient image aesthetic analysis system using Hadoop.Signal Processing: Image Communication,39, 499-508.
    連結:
  8. [12] Ghazi, M. R., & Gangodkar, D. (2015). Hadoop, MapReduce and HDFS: A Developers Perspective.Procedia Computer Science,48, 45-50.
    連結:
  9. [19] Friedl, M. A., & Brodley, C. E. (1997). Decision tree classification of land cover from remotely sensed data. Remote sensing of environment, 61(3), 399-409.
    連結:
  10. [20] Chen, C. W., Luo, J., & Parker, K. J. (1998). Image segmentation via adaptive K-mean clustering and knowledge-based morphological operations with biomedical applications. Image Processing, IEEE Transactions on, 7(12), 1673-1683.
    連結:
  11. [31] Srikanth, B. V. S., & Reddy, V. K. (2016). Efficiency of stream processing engines for processing BIGDATA Streams. Indian Journal of Science and Technology, 9(14).
    連結:
  12. [34] Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., & Hellerstein, J. M. (2012). Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment, 5(8), 716-727.
    連結:
  13. [36] Li, Y., Wang, S., & Ding, X. (2010, September). Person-independent head pose estimation based on random forest regression. In Image Processing (ICIP), 2010 17th IEEE International Conference on (pp. 1521-1524). IEEE.
    連結:
  14. [38] Linden, G., Smith, B., & York, J. (2003). Amazon. com recommendations: Item-to-item collaborative filtering. IEEE Internet computing, 7(1), 76-80.
    連結:
  15. [43] Lin, C. Y., Tsai, C. H., Lee, C. P., & Lin, C. J. (2014, October). Large-scale logistic regression and linear support vector machines using Spark. In Big Data (Big Data), 2014 IEEE International Conference on (pp. 519-528). IEEE.
    連結:
  16. [44] Watson, G. S. (1967). Linear least squares regression
    連結:
  17. [45] . The Annals of Mathematical Statistics, 1679-1699.
    連結:
  18. [46] Golub, G. (1965). Numerical methods for solving linear least squares problems. Numerische Mathematik, 7(3), 206-216.
    連結:
  19. [48] Ohsowski, B. M., Dunfield, K. E., Klironomos, J. N., & Hart, M. M. (2016). Improving Plant Biomass Estimation in the Field Using Partial Least Squares Regression and Ridge Regression. Botany, (ja).
    連結:
  20. [49] Ohsowski, B. M., Dunfield, K. E., Klironomos, J. N., & Hart, M. M. (2016). Improving Plant Biomass Estimation in the Field Using Partial Least Squares Regression and Ridge Regression. Botany, (ja).
    連結:
  21. [50] Li, Y., Wang, S., & Ding, X. (2010, September). Person-independent head pose estimation based on random forest regression. In Image Processing (ICIP), 2010 17th IEEE International Conference on (pp. 1521-1524). IEEE.
    連結:
  22. [52] Prasad, A. M., Iverson, L. R., & Liaw, A. (2006). Newer classification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems, 9(2), 181-199.
    連結:
  23. [53] Li, Y., Wang, S., & Ding, X. (2010, September). Person-independent head pose estimation based on random forest regression. In Image Processing (ICIP), 2010 17th IEEE International Conference on (pp. 1521-1524). IEEE.
    連結:
  24. [54] Alsheikh, M. A., Niyato, D., Lin, S., Tan, H. P., & Han, Z. (2016). Mobile Big Data Analytics Using Deep Learning and Apache Spark. arXiv preprint arXiv:1602.07031.
    連結:
  25. [56] Shi, W., Zhu, Y., Huang, T., Sheng, G., Lian, Y., Wang, G., & Chen, Y. (2016). An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment. Journal of Signal Processing Systems, 1-16.
    連結:
  26. [58] Ramirez-Gallego, S., Garcia, S., Mourino-Talin, H., & Martinez-Rego, D. (2015, August). Distributed Entropy Minimization Discretizer for Big Data Analysis under Apache Spark. In Trustcom/BigDataSE/ISPA, 2015 IEEE (Vol. 2, pp. 33-40). IEEE.
    連結:
  27. [60] Domoney, W. F., Ramli, N., Alarefi, S., & Walker, S. D. (2015, December). Smart city solutions to water management using self-powered, low-cost, water sensors and apache spark data aggregation. In 2015 3rd International Renewable and Sustainable Energy Conference (IRSEC) (pp. 1-4). IEEE.
    連結:
  28. [63] Vavilapalli, V. K., Murthy, A. C., Douglas, C., Agarwal, S., Konar, M., Evans, R., ... & Saha, B. (2013, October). Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing(p. 5). ACM.
    連結:
  29. [64] White, T. (2012). Hadoop: The definitive guide. " O'Reilly Media, Inc.".
    連結:
  30. [66] Shan, M. C., & Murphy, M. C. (1994). U.S. Patent No. 5,325,525. Washington, DC: U.S. Patent and Trademark Office.
    連結:
  31. [70] Kokilavani, T., & Amalarethinam, D. G. (2011). Load balanced min-min algorithm for static meta-task scheduling in grid computing. International Journal of Computer Applications, 20(2), 43-49.
    連結:
  32. [71] Lee, Y. H., Leu, S., & Chang, R. S. (2011). Improving job scheduling algorithms in a grid environment. Future generation computer systems, 27(8), 991-998.
    連結:
  33. [73] Yu Jun, Xiang Hai et al., Spark Core Technology and Advanced Applications, China Machine Press, pp. 238-253, 2015.
    連結:
  34. [74] Maniezzo, A. C. M. D. V. (1992). Distributed optimization by ant colonies. In Toward a practice of autonomous systems: proceedings of the First European Conference on Artificial Life (p. 134). Mit Press.
    連結:
  35. [78] Chaiken, R., Jenkins, B., Larson, P. Å., Ramsey, B., Shakib, D., Weaver, S., & Zhou, J. (2008). SCOPE: easy and efficient parallel processing of massive data sets. Proceedings of the VLDB Endowment, 1(2), 1265-1276.
    連結:
  36. [79] Lama, P., & Zhou, X. (2012, September). Aroma: Automated resource allocation and configuration of mapreduce environment in the cloud. In Proceedings of the 9th international conference on Autonomic computing (pp. 63-72). ACM.
    連結:
  37. [80] Gunn, S. R. (1998). Support vector machines for classification and regression. ISIS technical report, 14, 85-86.
    連結:
  38. [83] 徐佳俊, 刘功申, 苏波, & 孟魁. (2016). 基于 Spark 的异构叢集调度策略研究 Adaptive Scheduling Strategy for Heterogeneous Spark Cluster. Computer Science and Application, 6(11), 692.
    連結:
  39. [85] J. MacQueen, "Some Methods for Classification and Analysis of Multivariate Observations," in Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967, pp. 281-297.
    連結:
  40. [90] Massie, M. L., Chun, B. N., & Culler, D. E. (2004). The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7), 817-840.
    連結:
  41. [92] Taieb, S. B., & Hyndman, R. J. (2014). A gradient boosting approach to the Kaggle load forecasting competition. International journal of forecasting, 30(2), 382-394.
    連結:
  42. [3] Buyya, R., Abramson, D., & Giddy, J. (2000, May). Nimrod/G: An architecture for a resource management and scheduling system in a global computational grid. In High Performance Computing in the Asia-Pacific Region, 2000. Proceedings. The Fourth International Conference/Exhibition on (Vol. 1, pp. 283-289). IEEE.
  43. [4] Qureshi, K. (2001). An empirical study of task scheduling strategies for image processing application on heterogeneous distributed computing system. Parallel and Distributed Computer Graphics, 103.
  44. [9] Apache Spark, https://zh.wikipedia.org/wiki/Apache_Spark, June 2017 (last accessed:2017/06/11)
  45. [10] Khare, R., Cutting, D., Sitaker, K., & Rifkin, A. (2004). Nutch: A flexible and scalable open-source web search engine. Oregon State University, 1, 32-32.
  46. [13] Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: cluster computing with working sets. HotCloud, 10, 10-10.
  47. [14] Qiu, H., Gu, R., Yuan, C., & Huang, Y. (2014, May). YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark. In Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International (pp. 1664-1671). IEEE.
  48. [15] Gu, L., & Li, H. (2013, November). Memory or time: Performance evaluation for iterative operation on hadoop and spark. In High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on (pp. 721-727). IEEE.
  49. [16] Shvachko, K., Kuang, H., Radia, S., & Chansler, R. (2010, May). The hadoop distributed file system. In Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on (pp. 1-10). IEEE.
  50. [17] Vora, M. N. (2011, December). Hadoop-HBase for large-scale data. In Computer science and network technology (ICCSNT), 2011 international conference on (Vol. 1, pp. 601-605). IEEE.
  51. [18] Palankar, M. R., Iamnitchi, A., Ripeanu, M., & Garfinkel, S. (2008, June). Amazon S3 for science grids: a viable solution?. In Proceedings of the 2008 international workshop on Data-aware distributed computing (pp. 55-64). ACM.
  52. [21] Everitt, B. S., & Dunn, G. (1993). Principal components analysis. Applied Multivariate Data Analysis, Second Edition, 48-73.
  53. [22] Li, H., Wang, Y., Zhang, D., Zhang, M., & Chang, E. Y. (2008, October). Pfp: parallel fp-growth for query recommendation. In Proceedings of the 2008 ACM conference on Recommender systems (pp. 107-114). ACM.
  54. [23] Kutner, M. H., Nachtsheim, C., & Neter, J. (2004). Applied linear regression models. McGraw-Hill/Irwin.
  55. [24] Armbrust, M., Xin, R. S., Lian, C., Huai, Y., Liu, D., Bradley, J. K., ... & Zaharia, M. (2015, May). Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data(pp. 1383-1394). ACM.
  56. [25] Zaharia, M., Das, T., Li, H., Shenker, S., & Stoica, I. (2012, June). Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In Presented as part of the.
  57. [26] Thusoo, A., Sarma, J. S., Jain, N., Shao, Z., Chakka, P., Anthony, S., ... & Murthy, R. (2009). Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2), 1626-1629.
  58. [27] Hamilton, G., Cattell, R., & Fisher, M. (1997). JDBC Database Access with Java(Vol. 7). Addison Wesley.
  59. [28] Thein, K. M. M. (2014). Apache kafka: Next generation distributed messaging system. International Journal of Scientific Engineering and Technology Research, 3(47), 9478-9483.
  60. [29] Hoffman, S. (2013). Apache Flume: Distributed Log Collection for Hadoop. Packt Publishing Ltd.
  61. [30] Sakaki, T., Okazaki, M., & Matsuo, Y. (2010, April). Earthquake shakes Twitter users: real-time event detection by social sensors. In Proceedings of the 19th international conference on World wide web (pp. 851-860). ACM.
  62. [32] Xin, R. S., Gonzalez, J. E., Franklin, M. J., & Stoica, I. (2013, June). Graphx: A resilient distributed graph system on spark. In First International Workshop on Graph Data Management Experiences and Systems (p. 2). ACM.
  63. [33] Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., & Czajkowski, G. (2010, June). Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data (pp. 135-146). ACM.
  64. [35] Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., ... & Xin, D. (2016). Mllib: Machine learning in apache spark. The Journal of Machine Learning Research, 17(1), 1235-1241.
  65. [37] Son, J., Jung, I., Park, K., & Han, B. (2015). Tracking-by-Segmentation With Online Gradient Boosting Decision Tree. In Proceedings of the IEEE International Conference on Computer Vision (pp. 3056-3064).
  66. [39] http://spark.apache.org/docs/latest/ml-guide.html#announcement-dataframe-based-api-is-primary-api
  67. [40] Salperwyck, C., Maby, S., Cubillé, J., & Lagacherie, M. (2015, September). CourboSpark: Decision Tree for Time-series on Spark. In AALTD@ PKDD/ECML.
  68. [41] Kaveh, M. (2015). ETL and Analysis of IoT data using OpenTSDB, Kafka, and Spark (Master's thesis, University of Stavanger, Norway).
  69. [42] Arora, S. (2017). Analyzing mobile phone usage using clustering in Spark MLLib and Pig. International Journal, 8(1).
  70. [47] Rajaratnam, B., Roberts, S., Sparks, D., & Dalal, O. (2016). Lasso regression: estimation and shrinkage via the limit of Gibbs sampling. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(1), 153-174.
  71. [51] Son, J., Jung, I., Park, K., & Han, B. (2015). Tracking-by-Segmentation With Online Gradient Boosting Decision Tree. In Proceedings of the IEEE International Conference on Computer Vision (pp. 3056-3064).
  72. [55] Maarala, A. I., Rautiainen, M., Salmi, M., Pirttikangas, S., & Riekki, J. (2015, October). Low latency analytics for streaming traffic data with Apache Spark. In Big Data (Big Data), 2015 IEEE International Conference on (pp. 2855-2858). IEEE.
  73. [57] Awan, A. J., Brorsson, M., Vlassov, V., & Ayguade, E. (2016). Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study. arXiv preprint arXiv:1604.08484.
  74. [59] Kulkarni, S. (2015). A Recommendation Engine Using Apache Spark.
  75. [61] Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A. D., Katz, R. H., ... & Stoica, I. (2011, March). Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. In NSDI (Vol. 11, No. 2011, pp. 22-22).
  76. [62] Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., & Stoica, I. (2011, March). Dominant Resource Fairness: Fair Allocation of Multiple Resource Types. In NSDI (Vol. 11, No. 2011, pp. 24-24).
  77. [65] Wang, S. C., Yan, K. Q., Liao, W. P., & Wang, S. S. (2010, July). Towards a load balancing in a three-level cloud computing network. In Computer Science and Information Technology (ICCSIT), 2010 3rd IEEE International Conference on(Vol. 1, pp. 108-113). IEEE.
  78. [67] He, X., Sun, X., & Von Laszewski, G. (2003). QoS guided min-min heuristic for grid task scheduling. Journal of Computer Science and Technology, 18(4), 442-451.
  79. [68] He, X., Sun, X., & Von Laszewski, G. (2003). QoS guided min-min heuristic for grid task scheduling. Journal of Computer Science and Technology, 18(4), 442-451.
  80. [69] Bhoi, U., & Ramanuj, P. N. (2013). Enhanced max-min task scheduling algorithm in cloud computing. International Journal of Application or Innovation in Engineering and Management (IJAIEM), 2(4), 259-264.
  81. [72] Parsa, S., & Entezari-Maleki, R. (2009). RASA: A new task scheduling algorithm in grid environment. World Applied sciences journal, 7(Special issue of Computer & IT), 152-160.
  82. [75] Manjaly, J. S., & Chooralil, V. S. (2013, August). Tasktracker aware scheduling for hadoop mapreduce. In Advances in Computing and Communications (ICACC), 2013 Third International Conference on (pp. 278-281). IEEE.
  83. [76] Hadoop's Capacity Scheduler http://hadoop.apache.org/core/docs/current/ capacity-scheduler.html.
  84. [77] Ferguson, A. D., Bodik, P., Kandula, S., Boutin, E., & Fonseca, R. (2012, April). Jockey: guaranteed job latency in data parallel clusters. In Proceedings of the 7th ACM european conference on Computer Systems (pp. 99-112). ACM.
  85. [81] Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A., & Buyya, R. (2015). Big Data computing and clouds: Trends and future directions. Journal of Parallel and Distributed Computing, 79, 3-15.
  86. [82] Zhang, Q., Zhani, M. F., Zhang, S., Zhu, Q., Boutaba, R., & Hellerstein, J. L. (2012, September). Dynamic energy-aware capacity provisioning for cloud computing environments. In Proceedings of the 9th international conference on Autonomic computing (pp. 145-154). ACM.
  87. [84] Kopytov, A. (2004). SysBench: a system performance benchmark. URL: http://sysbench. sourceforge. net.
  88. [86] Leonard Kaufman and J. Peter Rousseeuw, "Clustering by means of Medoids," in Statistical Data Analysis Based on the L_1–Norm and Related Methods, 1987, pp. 405–416.
  89. [87] Leonard Kaufman and Peter J. Rousseuw, Finding Groups in Data: An Introduction to Cluster Analysis.: Wiley, 1990.
  90. [88] Wen-Hao Cheng, "The development of an intelligent, cloud-basedremote monitoring management system," 國立中山大學電機工程學系研究所, 碩士論文 2012.
  91. [89] Sachdeva, H. K., & Jain, A. (2017). The Survey of Enhanced Min-Max Approach For Resource Aware Job Scheduling In Cloud Computing. International Journal, 8(4).
  92. [91] Poess, M., Nambiar, R. O., & Walrath, D. (2007, September). Why you should run TPC-DS: a workload analysis. In Proceedings of the 33rd international conference on Very large data bases (pp. 1138-1149). VLDB Endowment.
  93. [93] Puurula, A., Read, J., & Bifet, A. (2014). Kaggle LSHTC4 winning solution. arXiv preprint arXiv:1405.0546.
  94. [94] Bholowalia, P., & Kumar, A. (2014). EBK-means: A clustering technique based on elbow method and k-means in WSN. International Journal of Computer Applications, 105(9).