题名 |
改善Hadoop MapReduce負載平衡之機制 |
DOI |
10.29428/9789860544169.201801.0152 |
作者 |
薛富泓;羅壽之 |
关键词 |
Hadoop ; MapReduce ; 負載平衡 ; 資料傾斜 ; Hadoop ; MapReduce ; Load Balancing ; Data Skew |
期刊名称 |
NCS 2017 全國計算機會議 |
卷期/出版年月 |
2017(2018 / 01 / 01) |
页次 |
808 - 813 |
内容语文 |
繁體中文 |
中文摘要 |
MapReduce是一種具有簡單易用、高容錯性、高可擴展性等優點的平行處理框架,近年來被廣泛應用於巨量資料處理。然而,MapReduce在處理資料密集型應用程式時經常遭遇資料傾斜問題,Hadoop預設之雜湊分區函式在處理此類資料集時,大多數情況下均無法將工作量均勻分配給各reducer。為了降低資料傾斜對MapReduce效能造成的負面影響,本論文提出一種具優先權之負載平衡機制,此機制結合蓄水池抽樣法、二階段貪婪演算法,以及一個分割reduce keys演算法,並將資料區域性、優先權等概念融入方法中。實驗結果顯示,本論文提出的具優先權之負載平衡機制除了能有效地減少各reducer需透過網路傳輸其輸入的資料量,更重要的是使各reducer間達到負載平衡。 |
英文摘要 |
MapReduce is a parallel processing framework with strengths such as simple operation, high fault tolerance, and high scalability, which has been widely used in big data processing in recent years. However, data skew is a frequent problem when MapReduce is used to handle data-intensive applications. When the default hash partitioning function in Hadoop processes such datasets, the workload cannot be evenly distributed to each reducer under most circumstances. To mitigate the negative effects of data skew on the performance of MapReduce, this paper proposes a priority-based load balancing mechanism which combines reservoir sampling, the two-phase greedy algorithm, and an algorithm for splitting reduce keys, and further incorporates the concepts of data locality and priority. The results of our experiments demonstrate that the proposed priority-based load balancing mechanism can effectively reduce the amount of data input through network transmission to each reducer. More importantly, this mechanism balances the workload of each reducer. |
主题分类 |
基礎與應用科學 >
資訊科學 |