题名

Apache Spark 運用於虛擬化技術之效益研究

并列篇名

The benefit research of virtualization for Apache Spark.

DOI

10.6840/cycu201700737

作者

徐思縯

关键词

大數據 ; 虛擬化 ; Hadoop ; HDFS ; MapReduce ; Spark ; RDD ; Big Data ; VMware ; Hadoop ; HDFS ; MapReduce ; Spark ; RDD

期刊名称

中原大學資訊管理學系學位論文

卷期/出版年月

2017年

学位类别

碩士

导师

洪智力

内容语文

繁體中文

中文摘要

Spark分散式架構以虛擬化方式建置,除了可快速建置Spark分散式環境,也可有效發揮硬體效能及彈性分配硬體資源,並節省硬體預算成本,本研究將會利用VMware虛擬化技術建置Spark 分散式系統,使用Hadoop HDFS 分散式檔案系統存取資料,資料分析方式則是使用Spark RDD In-Memory資料運算框架進行效能分析。實驗方式將會使用二次排序以及Wordcount結合TopK 這二種方法對300GB的資料量進行效能測試,相互交叉驗證,依實驗階段調整CPU、記憶體大小及運算節點,最後找出最佳的硬體配置結果。 在實驗結果中可驗証Spark分散式系統Node越多資料分析越快的特性,但在處理30GB小量資料,如果每個Node硬體資源足夠時,資料分析效能到達一定瓶頸後則無法再增加。

英文摘要

Spark distributed system architecture is deployed through virtualisation. In addition to being quick to deploy, this architecture enables the effective usage of a computer’s hardware capacity and the resilient distribution of hardware resources, which reduces hardware costs. This study used the virtualisation technology of Virtual Machine Software to deploy a Spark distributed system and the Hadoop Distributed File System to access data. Data analysis was conducted through a performance analysis of the in-memory computing framework of Spark resilient distributed datasets (RDD). In this research, the two methods of secondary sorting and WordCount combined with Top-K were employed to test performance on a data volume of 300 GB. These two methods were then cross-validated, and the system CPUs, memory, and computing nodes were adjusted according to the experimental phases to determine the optimal hardware configuration. Experimental results verified that using more nodes resulted in more rapid data analysis in a Spark distributed system. However, when processing of a small data volume such as 30 GB was performed, and given that the hardware resources of each node were sufficient, data analysis performance could not be improved further after it had reached a certain threshold.

主题分类 商學院 > 資訊管理學系
社會科學 > 管理學
参考文献
  1. [8] 周建廷(2011)。國立臺灣師範大學碩士論文。利用MapReduce軟體架構於Hadoop叢集進行地貌型直接逕流模組演算之研究。
    連結:
  2. [14] Jeffrey Dean, Sanjay Ghemawat:MapReduce: simplified data processing on large clusters. Commun. ACM, Vol. 51, pp. 107-113, 2008.
    連結:
  3. [17] 簡玠忠(2013)。國立中興大學碩士論文。基於Hadoop框架建立巨量資料分析處理模型研究。
    連結:
  4. [26]Huang Chao-Qiang , et al. RDDShare: Reusing Results of Spark RDD, IEEE International Conference on Data Science in Cyberspace (DSC), June 2016
    連結:
  5. [27] R. Uhlig, G. Neiger, et al. Intel virtualization technology, IEEE Computer Society Computer, May 2005
    連結:
  6. [28] Gopalani, Satish, and Rohan Arora. Comparing apache spark and map reduce with performance analysis using K-means. International Journal of Computer Applications 113.1 (2015).
    連結:
  7. [33] M. Tim Jones. (2010.5.25). Virtualization. Datamation. Retrieved from http://www.datamation.com/netsys/article.php/3884091/Virtualization.htm
    連結:
  8. [36] Herrod, Steve. The Future of Virtualization Technology. Computer Architecture News 34.2 (2006): 352.
    連結:
  9. [39] 張彥文(2011)。崑山科技大學碩士論文。應用虛擬化技術分散式異常流量偵測系統之設計。
    連結:
  10. [45] Solaimani, Mohiuddin, et al. Statistical technique for online anomaly detection using spark over heterogeneous data from multi-source VMware performance data. Big Data (Big Data), IEEE International Conference, Oct 2014.
    連結:
  11. [1] Big data. (2017.7.7). Wikipedia. Retrieved from https://en.wikipedia.org/wiki/ Big_data
  12. [2] 楊采容(民國104年4月2號)。IoT物聯網市場趨勢與最新技術應用,Digitimes。取自http://www.digitimes.com.tw/iot/article.asp?cat=130&id=0000 418508_8sm60u7qlptamx9x35qe4
  13. [3] 林大貴,Hadoop+Spark大數據巨量分析與機器學習整合開發實戰,一版,台灣,博碩股份有限公司,民國104年。
  14. [4] Apache Hadoop. (2017.07.18). Retrieved from http://hadoop.apache.org/
  15. [5] Apache Spark. (2017.6.11). Wikipedia. Retrieved from https://zh.wikipedia.org /wiki/Apache_Spark
  16. [6] VMware Website(民國106年)。什麼是虛擬化。取自http://www.vmware. com/tw/solutions/virtualization.html
  17. [7] About Vardhan. (2015.12.18). Apache Spark vs Hadoop MapReduce. Edureka Blog. Retrieved from https://www.edureka.co/blog/apache-spark-vs-hadoop-mapreduce
  18. [9] 沈炳宏(民國98年11月)。初探Hadoop開放原始碼平臺環境,Could computing Thinktank。取自http://www.runpc.com.tw/content/cloud_content. aspx?id=105318
  19. [10] 陳克豪(1999)。中華大學碩士論文。應用在MapReduc新型負載平衡規劃。
  20. [11] Mehul Nalin Vora, Hadoop-HBase for Large-Scale Data, Proceedings of the International Conference on Computer Science and Network Technology, 2011.
  21. [12] Tom White, Hadoop: The Definitive Guide, O’Reilly Media, June 5, 2009.
  22. [13] 蔡碧展(2010)。國立高雄第一科技大學碩士論文。基於Hadoop平臺的雲端基因架構。
  23. [15] 陳彥棠(2015)。亞洲大學碩士論文。提升Hadoop MapReduce計算效能之研究-以抽取樣式歷史為例。
  24. [16] Dhruba Borthakur, Hadoop 0.20.2 Documentation-HDFS Architecture, The Apache Software Foundation, 2008.
  25. [18] 卓志遠(2013)。東海大學碩士論文。Hadoop分散式檔案系統與Ceph效能比較。
  26. [19] Farag Azzedin, Towards a scalable HDFS architecture, IEEE International Conference on Collaboration Technologies and Systems (CTS), May 2013
  27. [20] 王建興(民國104年月26日)。分散式計算的新角色,ITHome。取自http://www.ithome.com.tw/voice/94139
  28. [21] Reynold Xin. (2014.11.5). Apache Spark officially sets a new record in large-scale sorting. Engineeging Blog. Retrieved from https://databricks.com/blog/ 2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
  29. [22] Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills, Advanced Analytics with Spark: Patterns for Learning from Data at Scale, O’Reilly Media, Inc., April 2015.
  30. [23] Ashwini Kuntamukkala, Software Architect, SciSpike. (2017). Apache Spark. Dzone Website. Retrieved from https://dzone.com/refcardz/apache-spark
  31. [24] Zaharia, Matei, et al. Spark: Cluster computing with working sets. HotCloud10.10-10 (2010): 95.
  32. [25] Saurabh Chhajed. (2015.10.26). What Is RDD in Spark and Why Do We Need It?. DZone Big Data Zone. Retrieved from https://dzone.com/articles/what-is-rdd-in-spark-and-why-do-we-need-it
  33. [29] NEC Website. (2017) . What is server virtualization?. NEC. Retrieved from http://www.nec.com/en/global/solutions/servervirtualization/merit.html
  34. [30] 陳稟升(2014)。國立高雄應用科技大學碩士論文。企業基礎建設的虛擬化設計與實現。
  35. [31] 詹智斌(2013)。德明財經科技大學在職專班碩士論文。虛擬化分散式運算環境之實作與評估。
  36. [32] 許家勝(2014)。東海大學碩士論文。整合OPENSTACK 和DOCKER 建構出動態遷移的雲端虛擬化環境。
  37. [34] 梅國樂(2010)。國立東華大學碩士論文。適用於虛擬化網路之自動化佈署機制。
  38. [35] 宋昱琳(2016)。東海大學碩士論文。使用高效能Linpack的虛擬化平臺性能評估。
  39. [37] 張廣欽(2015)。東海大學碩士論文。以虛擬化叢集及 DRBD 建構雙重高可用性雲端服務。
  40. [38] VMware Website. (2017). Retrieved from http://www.vmware.com/tw/products /datacenter-virtualization.html
  41. [40] 巫柏毅(2016)。臺灣科技大學碩士論文。因應企業迅速擴張之資訊系統虛擬化設計。
  42. [41] 陳彥棠(2015)。亞洲大學碩士論文。提升Hadoop MapReduce計算效能之研究-以抽取樣式歷史為例。
  43. [42] 邱瑞(2013)。逢甲大學碩士論文。中小企業導入伺服器虛擬化之探討-以K公司為例。
  44. [43] 黃植懋(民國97年3月20日)。伺服器虛擬化技術簡介,台灣大學電子報。取自http://www.cc.ntu.edu.tw/chinese/epaper/0004/20080320_4012.htm
  45. [44] Apache Spark. (2017.5.2). Retrieved from http://spark.apache.org/
  46. [46] Shvachko, Konstantin, et al. "The hadoop distributed file system." Mass storage systems and technologies (MSST), IEEE 26th symposium , May 2010.
  47. [47] Sorting algorithm. (2017.7.1). Retrieved from https://en.wikipedia.org/wiki/ Sorting_algorithm
  48. [48] Sort Benchmark Home Page. (2016). Retrieved from http://sortbenchmark.org/
  49. [49] 線性回歸(民國106年5月28日)。維基百科。取自 https://zh.wikipedia. org/wiki/線性迴歸
  50. [50] 回歸分析(民國2009年5月18日)。取自 http://www.fivedream.com/page1. aspx?no=221249&step=1&newsno=20418