题名

以容器化技術提升網頁擷取效能

并列篇名

Improving Webpage Capture Performance with Containerization Technology

作者

陳志達(Andy Chen);郭柏均(Po-Chuing Kuo)

关键词

容器化技術 ; 網路爬蟲 ; 分散式爬蟲 ; 負載平衡 ; 任務排程 ; Container ; Containerization ; Web Crawler ; Distributed Crawler ; Load Balance ; Task Scheduler

期刊名称

資訊與管理科學

卷期/出版年月

13卷1期(2020 / 07 / 30)

页次

37 - 55

内容语文

繁體中文

中文摘要

資料是近代重要的資產,想取得大量資料的方法之一就是透過快速且有效率的分散式爬蟲系統架構來獲得。近年來容器化技術成為熱門話題,特色在於輕量、節省系統資源,且能快速建置執行環境以及減少維護成本,很多大型的國際企業都正在使用或正往容器化技術邁進。因此,本研究將基於容器化技術來建構分散式爬蟲系統,擷取目標為書籍電商平台,讓爬蟲在容器中執行,並從網頁中擷取書籍資料。分散式爬蟲和傳統架構的差異在於可同時管理多個的爬蟲任務,因此分散式架構會比傳統的單或多執行緒架構的爬蟲系統還要更快速且高效率。此外,本研究會將分散式爬蟲系統所擷取的書籍資料應用於書籍比價平台,讓使用者能找到該本書籍最高優惠的購買站點。

英文摘要

How to get useful information is the key to the success of the information system. One way to get a lot of information is through web crawlers. It is a common solution to use a fast and efficient distributed crawler system architecture. In recent years, containerization technology has become a hot topic. Main characters of containers are characterized by light weight, saving system resources, and can quickly implement the execution environment and reduce maintenance costs. Therefore, many large international companies are now using or moving toward containerization technology. Therefore, this study will build a distributed crawler system based on containerization technology. The target of the crawler system is the book e-commerce platforms, which allow the crawler to execute in the container and retrieve the book information from the webpage. The difference between the distributed crawler architecture and the traditional crawler architecture is that a distributed crawler needs to manage multiple crawling tasks those are being executed at the same time. The system will arrange which nodes to execute each task through container scheduling and load balancing. As a result, the architecture of a distributed crawler system is faster and more efficient than a traditional single-threaded or multi-threaded crawler system. In addition, this study will apply the book materials collected by the distributed crawler system to the book price comparison platform, so that users can find the highest discount purchase site of the book through the book price comparison platform.

主题分类 基礎與應用科學 > 資訊科學
社會科學 > 管理學
参考文献
  1. Selenium WebDriver, Retrieved from Selenium WebDriver: https://www.seleniumhq.org/projects/webdriver/.
  2. Nightwatch, Retrieved from Nightwatch: http://nightwatchjs.org/.
  3. curl, "curl.1 the man page", Retrieved from curl Documentation: https://curl.haxx.se/docs/manpage.html.
  4. Achsana, H.,Wibowob, W. C.(2014).A Fast Distributed Focused-web Crawling.Procedia Engineering,69,492-499.
  5. Cardellini, V.,Colajanni, M.,Yu, P. S.(1999).Dynamic Load Balancing on Web-server Systems.IEEE Internet Computing,3(3),28-39.
  6. Deepika,Dixit, A..URL ordering policies for distributed crawlers: a review.International Conference on Recent Trends in Computer and Information Technology
  7. Jain, A. ,Singh, A. ,Liu, L.(2000).,未出版
  8. Najork, M.,Heydon, A.(2001).High-Performance Web Crawling.Handbook of Massive Data Sets
  9. Puppeteer, Retrieved from Puppeteer: https://pptr.dev/.
  10. Shkapenyuk, V.,Suel, T.(2002).Design and Implementation of a High-Performance Distributed Web Crawler.Proceedings 18th International Conference on Data Engineering
  11. 李昆達(2009)。大同大學資訊工程學系(所)。