题名

以Apache Spark框架設計OpenStack雲端系統日誌平行化分析叢集系統之研究

并列篇名

Design of A Parallel Log Analysis System in OpenStack Cloud System with Apache Spark Framework

作者

范植翔

关键词

ELK ; Spark ; OpenStack ; 雲端運算 ; 日誌分析 ; 巨量資料 ; ELK ; Spark ; OpenStackCloud computing ; Log analysis ; Big data

期刊名称

臺中科技大學資訊工程系碩士班學位論文

卷期/出版年月

2016年

学位类别

碩士

导师

陳弘明;陳世穎

内容语文

繁體中文

中文摘要

隨著現今雲端技術的蓬勃發展以及物聯網的時代來臨,相關的雲端軟硬體設備不斷升級,生活中的雲端相關應用也漸漸普及,因此雲端系統與服務中如何提供高可靠性的雲端環境將顯得額外的重要,而對於IT人員來說在維運雲端系統平台上也面臨了極大的挑戰,有鑑於此,藉由雲端系統平台下的日誌資料來進行動態的收集與合併來監控雲端系統平台的維運狀況是有其必要性。本論文提出一基於開放原始碼OpenStack作業系統上的集中式日誌管理與分析系統,針對OpenStack系統上分散式的日誌資料進行動態的資料收集、儲存與視覺化統計分析,並搭配了開放式原始碼Apache Spark分散式運算框架進行日誌資料探勘分析,提供高效能資料分析的解決方案。本研究進一步針對Spark分散式運算框架進行探討與評估,包括了Spark串流分析與批次分析分別運行在Mesos模式下的粗細粒度排程和Yarn模式下的粗粒度排程之效能差異,以及基於SparkMlib實作Streaming k-means演算法與迴歸演算法預測模型分析,並且透過不同演算法參數設定、不同叢集節點數量與不同記憶體大小等可能影響模型的效能與模型精準度之相關參數進行實驗,藉此評估出最佳的參數設定與平行化方式。

英文摘要

With the flourishing development of current Cloud technology and the coming of the Internet of Things, equipment of cloud-related software and hardware have continuously upgraded, and cloud-related application in live is also gradually widespread; therefore, how to provide high reliability cloud environment in cloud system and service is very important. However, for IT professionals, they also face great challenges in the maintenance and operation of cloud system platform. In view of this, it is necessary for performing dynamic collection and merger with the log data under cloud system platform to monitor the maintenance and operation condition of cloud system platform. The thesis proposes a centralized log management and analysis system based on open source OpenStack operating system; aims at the distributed log data in OpenStack system to perform dynamic data collection, storage, and analysis of visualized statistics, also cooperates open source Apache Spark distributed computing frame to perform the log data exploration analysis to provide the solution for high performance data analysis. The study further directs at the Spark distributed computing frame to discuss and estimate including the performance difference in size scheduling between the Spark streaming analysis and the batch analysis which operate at the Mesos pattern and at the Yarn pattern respectively. Moreover, carry out the Streaming-KMeans algorithm and the regression algorithm based on the SparkMlib to predict model analysis. Different setting of algorithm parameter, different number of cluster node and different size of memory could affect related parameter for the performance and accuracy of model. Therefore we can estimate optimal parameter setting and parallel method.

主题分类 基礎與應用科學 > 資訊科學
資訊與流通學院 > 資訊工程系碩士班
参考文献
  1. [1] Datt, Aparna, Anita Goel, and S. C. Gupta. "Analysis of Infrastructure Monitoring Requirements for OpenStack Nova." Procedia Computer Science 54 (2015): 127-136.
    連結:
  2. [2] Corradi, Antonio, Mario Fanelli, and Luca Foschini. "VM consolidation: A real case based on OpenStack Cloud." Future Generation Computer Systems 32 (2014): 118-127.
    連結:
  3. [3] Chang, Victor, Yen-Hung Kuo, and Muthu Ramachandran. "Cloud computing adoption framework: A security framework for business clouds." Future Generation Computer Systems 57 (2016): 24-41.
    連結:
  4. [4] Pape, Christian, Sven Reissmann, and Sebastian Rieger. "RESTful Correlation and Consolidation of Distributed Logging Data in Cloud Environments." Proceedings of the Eighth International Conference on Internet and Web Applications and Services (ICIW). 2013.
    連結:
  5. [8] Haroon, Thasviya, et al. "Convivial Private Cloud Implementation System Using OpenStack."
    連結:
  6. [10] Comparison of open-source cloud management platforms: OpenStack and OpenNebula. InFuzzy Systems and Knowledge Discovery
    連結:
  7. [13] Liang, Xiao Yang, and Zhang Cen Guan. "Ceph CRUSH Data Distribution Algorithms." Applied Mechanics and Materials. Vol. 596. Trans Tech Publications, 2014.
    連結:
  8. [14] 蔡權昱, and 蔡錫鈞. NCTU CStack: OpenStack 與 Ceph 的整合與應用. Diss. 2013.
    連結:
  9. [15] Cash, S., et al. "Managed infrastructure with IBM Cloud OpenStack Services." IBM Journal of Research and Development 60.2-3 (2016): 6-1.
    連結:
  10. [20] McCreadie, Richard, Craig Macdonald, and Iadh Ounis. "MapReduce indexing strategies: Studying scalability and efficiency." Information Processing & Management 48.5 (2012): 873-888.
    連結:
  11. [23] Lam, Chuck. Hadoop in action. Manning Publications Co., 2010.
    連結:
  12. [27] Wang, Weining, et al. "An efficient image aesthetic analysis system using Hadoop." Signal Processing: Image Communication 39 (2015): 499-508.
    連結:
  13. [28] Ghazi, Mohd Rehan, and Durgaprasad Gangodkar. "Hadoop, MapReduce and HDFS: A Developers Perspective." Procedia Computer Science 48 (2015): 45-50.
    連結:
  14. [31] White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.", 2012.
    連結:
  15. [36] Chen, Chang Wen, Jiebo Luo, and Kevin J. Parker. "Image segmentation via adaptive K-mean clustering and knowledge-based morphological operations with biomedical applications." IEEE Transactions on Image Processing 7.12 (1998): 1673-1683.
    連結:
  16. [40] Shyam, R., et al. "Apache Spark a Big Data Analytics Platform for Smart Grid." Procedia Technology 21 (2015): 171-178.
    連結:
  17. [45] Xu, Min, et al. "Decision tree regression for soft classification of remote sensing data." Remote Sensing of Environment 97.3 (2005): 322-336.
    連結:
  18. [46] Li, Yali, Shengjin Wang, and Xiaoqing Ding. "Person-independent head pose estimation based on random forest regression." 2010 IEEE International Conference on Image Processing. IEEE, 2010.
    連結:
  19. [47] Son, Jeany, et al. "Tracking-by-Segmentation with Online Gradient Boosting Decision Tree." Proceedings of the IEEE International Conference on Computer Vision. 2015.
    連結:
  20. [48] Han, Jiawei, Jian Pei, and Yiwen Yin. "Mining frequent patterns without candidate generation." ACM Sigmod Record. Vol. 29. No. 2. ACM, 2000.
    連結:
  21. [51] Gopalani, Satish, and Rohan Arora. "Comparing apache spark and map reduce with performance analysis using K-means." International Journal of Computer Applications 113.1 (2015).
    連結:
  22. [52] Lin, Chieh-Yen, et al. "Large-scale logistic regression and linear support vector machines using Spark." Big Data (Big Data), 2014 IEEE International Conference on. IEEE, 2014.
    連結:
  23. [53] Rajaratnam, Bala, et al. "Lasso regression: estimation and shrinkage via the limit of Gibbs sampling." Journal of the Royal Statistical Society: Series B (Statistical Methodology) 78.1 (2016): 153-174.
    連結:
  24. [54] Ohsowski, Brian M., et al. "Improving Plant Biomass Estimation in the Field Using Partial Least Squares Regression and Ridge Regression." Botany ja (2016).
    連結:
  25. [55] Alsheikh, Mohammad Abu, et al. "Mobile big data analytics using deep learning and apache spark." IEEE Network 30.3 (2016): 22-29.
    連結:
  26. [56] Maarala, Altti Ilari, et al. "Low latency analytics for streaming traffic data with Apache Spark." Big Data (Big Data), 2015 IEEE International Conference on. IEEE, 2015.
    連結:
  27. [57] Shi, Weiwei, et al. "An Integrated Data Preprocessing Framework Based on Apache Spark for Fault Diagnosis of Power Grid Equipment." Journal of Signal Processing Systems (2016): 1-16.
    連結:
  28. [59] Ramírez-Gallego, Sergio, et al. "Distributed Entropy Minimization Discretizer for Big Data Analysis under Apache Spark." Trustcom/BigDataSE/ISPA, 2015 IEEE. Vol. 2. IEEE, 2015.
    連結:
  29. [61] Domoney, W. Frank, et al. "Smart city solutions to water management using self-powered, low-cost, water sensors and apache spark data aggregation." 2015 3rd International Renewable and Sustainable Energy Conference (IRSEC). IEEE, 2015.
    連結:
  30. [63] Langi, Pingkan PI, Warsun Najib, and Teguh Bharata Aji. "An evaluation of Twitter river and Logstash performances as elasticsearch inputs for social media analysis of Twitter." Information & Communication Technology and Systems (ICTS), 2015 International Conference on. IEEE, 2015.
    連結:
  31. [68] Bagnasco, S., et al. "Monitoring of IaaS and scientific applications on the Cloud using the Elasticsearch ecosystem." Journal of Physics: Conference Series. Vol. 608. No. 1. IOP Publishing, 2015.
    連結:
  32. [69] Bai, Jun. "Feasibility analysis of big log data real time search based on Hbase and ElasticSearch." 2013 Ninth International Conference on Natural Computation (ICNC). IEEE, 2013.
    連結:
  33. [71] Videla, Alvaro, and Jason JW Williams. RabbitMQ in action. Manning, 2012.
    連結:
  34. [76] Zhang, Shiming, et al. "Design and implementation of a real-time interactive analytics system for large spatio-temporal data." Proceedings of the VLDB Endowment 7.13 (2014): 1754-1759.
    連結:
  35. 參考文獻
  36. [5] Venzano, Daniele, and Pietro Michiardi. "A measurement study of data-intensive network traffic patterns in a private cloud." Proceedings of the 2013 IEEE/ACM 6th International Conference on Utility and Cloud Computing. IEEE Computer Society, 2013.
  37. [6] Kumar, Rakesh, et al. "Open source solution for cloud computing platform using OpenStack." International Journal of Computer Science and Mobile Computing 3.5 (2014): 89-98.
  38. [7] Gulabani, Sunil. Amazon S3 Essentials. Packt Publishing Ltd, 2015.
  39. [9] Bell, Tim, et al. "Scaling the CERN OpenStack cloud." Journal of Physics: Conference Series. Vol. 664. No. 2. IOP Publishing, 2015.
  40. [11] Kumar, Rakesh, et al. "Open source solution for cloud computing platform using OpenStack." International Journal of Computer Science and Mobile Computing 3.5 (2014): 89-98.
  41. [12] Gupta, Pratibha R., Sheetal Taneja, and Aparna Datt. "Using Heat and Ceilometer for providing Autoscaling in OpenStack."
  42. [16] Zhang, X., S. Gaddam, and A. T. Chronopoulos. "Ceph Distributed File System Benchmarks on an Openstack Cloud." 2015 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM). IEEE, 2015.
  43. [17] Weil, Sage A., et al. "Ceph: A scalable, high-performance distributed file system." Proceedings of the 7th symposium on Operating systems design and implementation. USENIX Association, 2006.
  44. [18] van der Ster, Daniel, and Arne Wiebalck. "Building an organic block storage service at CERN with Ceph." Journal of Physics: Conference Series. Vol. 513. No. 4. IOP Publishing, 2014.
  45. [19] Khare, Rohit, et al. "Nutch: A flexible and scalable open-source web search engine." Oregon State University 1 (2004): 32-32.
  46. [21] Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003.
  47. [22] Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
  48. [24] Apache Hadoop, http://hadoop.apache.org, April 2015(last accessed:2015/06/19)
  49. [25] Shvachko, Konstantin, et al. "The hadoop distributed file system." 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, 2010.
  50. [26] Apache MapReduce ,http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html, August 2015 (last accessed:2015/08/19)
  51. [29] Zaharia, Matei, et al. "Spark: cluster computing with working sets." HotCloud 10 (2010): 10-10.
  52. [30] Vavilapalli, Vinod Kumar, et al. "Apache hadoop yarn: Yet another resource negotiator." Proceedings of the 4th annual Symposium on Cloud Computing. ACM, 2013.
  53. [32] Qiu, Hongjian, et al. "Yafim: a parallel frequent itemset mining algorithm with spark." Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International. IEEE, 2014.
  54. [33] Gu, Lei, and Huan Li. "Memory or time: Performance evaluation for iterative operation on hadoop and spark." High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on. IEEE, 2013.
  55. [34] Apache Spark, https://zh.wikipedia.org/wiki/Apache_Spark, August 2016 (last accessed:2016/05/19)
  56. [35] Garion, S. Big Data Analytics Hadoop and Spark.
  57. [37] Borgelt, Christian. "An Implementation of the FP-growth Algorithm." Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations. ACM, 2005.
  58. [38] Seber, George AF, and Alan J. Lee. Linear regression analysis. Vol. 936. John Wiley & Sons, 2012.
  59. [39] Zaharia, Matei, et al. "Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters." Presented as part of the. 2012.
  60. [41] Kreps, Jay, Neha Narkhede, and Jun Rao. "Kafka: A distributed messaging system for log processing." Proceedings of the NetDB. 2011.
  61. [42] Ltd. Hoffman, Steve. Apache Flume: Distributed Log Collection for Hadoop. Packt Publishing Ltd, 2013.
  62. [43] Sakaki, Takeshi, Makoto Okazaki, and Yutaka Matsuo. "Earthquake shakes Twitter users: real-time event detection by social sensors." Proceedings of the 19th international conference on World wide web. ACM, 2010.
  63. [44] Amazon Kinesis,https://aws.amazon.com/tw/kinesis,Jan 2016(last accessed:2016/01/19)
  64. [49] Clustering-spark.mllib , http://spark.apache.org/docs/latest/mllib-clustering.html,Jun 2016 (accessed:2016/06/19)
  65. [50] Kaveh, Maziar. "ETL and Analysis of IoT data using OpenTSDB, Kafka, and Spark." (2015).
  66. [58] Awan, Ahsan Javed, et al. "Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study." arXiv preprint arXiv:1604.08484 (2016).
  67. [60] Kulkarni, Swapna. "A Recommendation Engine Using Apache Spark." (2015).
  68. [62] Zou, Qing. "A novel open source approach to monitor ezproxy users’ activities." The Code4Lib Journal 29 (2015).
  69. [64] Swamikrishnan, Pandikumar. "Centralize logs for IBM Bluemix apps using the ELK Stack." (2015).
  70. [65] Kononenko, Oleksii, et al. "Mining modern repositories with elasticsearch." Proceedings of the 11th Working Conference on Mining Software Repositories. ACM, 2014.
  71. [66] Paro, Alberto. ElasticSearch cookbook. Packt Publishing Ltd, 2015.
  72. [67] Peterle, Mattia. "Logstash: progetto open per l'analisi dei log in tempo reale di architetture cloud." (2013).
  73. [70] Gazzarini, Andrea. Apache Solr Essentials. Packt Publishing Ltd, 2015.
  74. [72] Abramova, Veronika, and Jorge Bernardino. "NoSQL databases: MongoDB vs cassandra." Proceedings of the International C* Conference on Computer Science and Software Engineering. ACM, 2013.
  75. [73] Ikebe, Minoru, and Kazuyuki Yoshida. "An Integrated Distributed Log Management System with Metadata for Network Operation." Complex, Intelligent, and Software Intensive Systems (CISIS), 2013 Seventh International Conference on. IEEE, 2013.
  76. [74] Appleyard, Rob, and James Adams. "Using the ELK Stack for CASTOR Application Logging at RAL." International Symposium on Grids and Clouds (ISGC). Vol. 15. No. 20. 2015.
  77. [75] Elasticsearch for Hadoop,https://www.elastic.co/products/hadoop,Apr 2016(last accessed:2016/04/19)
  78. [77] spark-perf https://github.com/databricks/spark-perf 2016(last accessed:2016/06/19)
  79. [78] 資管系. "隨機森林運用於白血病基因分類." (2013).