题名

高效率探勘關聯規則之演算法-EFI

并列篇名

An Efficient Algorithm for Mining Association Rules-EFI

DOI

10.6382/JIM.200704.0139

作者

黃仁鵬(Jen-Peng Huang);藍國誠(Guo-Cheng Lan)

关键词

資料探勘 ; 關聯法則 ; 二階段過濾 ; data mining ; association rules ; two phase filtrations

期刊名称

資訊管理學報

卷期/出版年月

14卷2期(2007 / 04 / 01)

页次

139 - 167

内容语文

繁體中文

中文摘要

資料探勘的技術變得日益重要,也廣泛的應用在商業上的預測以及決策的支援。關聯法則在資料探勘的領域中也扮演相當重要的地位,許多關聯法則演算法不斷被提出、改進,以增進效能或節省記憶體空間;本研究也朝著這個目標,試著改進關聯規則演算法為主要方向。 本研究主要是針對探勘關聯規則QDI演算法的特性及缺點來加以改進,雖然QDI演算法已是很有效率的演算法之一,不過,它還是有兩個最主要的問題。第一,QDI演算法無法探勘交易長度太長的交易資料庫。第二,QDI演算法對記憶體使用率不佳;因此,QDI演算法的實用性大打折扣。 基本於上述理由,本研究提出一個改良QDI演算法產生項目集的核心概念新演算法EFI (An Efficient Approach for Filtering Infrequent Itemsets)。EFI演算法的特色就是二階段過濾的方式,因為該過濾方式可大量減少非高頻項目集的數量,將更能適用於探勘交易長度較長的資料庫,僅需掃描資料庫四次且不需要產生任何候選項目集,即可快速找出關聯規則。另外,EFI演算法也改進ICI-like演算法因儲存大量項目集須耗用龐大記憶體空間的缺點,每筆交易經過二階段過濾機制的處理後,僅會產生最有可能成為高頻的項目集,因此,EFI能大量降低項目表須耗用的記憶體空間,以提升記憶體的使用率。在現實生活中的資料庫容量通常都是大於記憶體容量,為了解決此問題,EFI演算法將選擇採用資料庫分割方式繼續執行探勘任務,每個子資料庫僅需四次I/O動作,不隨著高頻項目集的長度增長而增加I/O次數,以避免耗費過多的I/O時間,也可有效提高執行效率與實用性。

英文摘要

The technology of data mining is more important in recent years, and it is generally applied to commercial forecast and decision supports. Association rules mining algorithms in the field of data mining play the important role. Many of association rules mining algorithms were proposed to improve the efficiency of data mining or save the utility rate of memory. So, our major study tries to improve the efficiency of association rules mining algorithms. In this paper, our major study is to improve the defects of the QDI algorithm. Although QDI algorithm was one of the most efficient algorithms, but it still has two serious problems; in the first place, QDI algorithm can't mine the transactions of databases whose record length is very long; in the second place, QDI algorithm isn't very efficient at utility rate of the memory. Therefore, the QDI algorithm is not very practical. Based on above reasons we propose a new algorithm-EFI (An Efficient Approach for Filtering Infrequent Itemsets) that is improved from QDI algorithm. The one of the characters of EFI algorithm is the two phrase filtrations which can reduce lots of non-frequent itemsets and is very suitable to mine the transactions of databases whose record length is very long. To find association rules quickly the EFI algorithm only scans database four times and doesn't generate any candidate itemset in mining process. Besides, the EFI algorithm also improves the defect of the ICI-like algorithms that need lots of memory spaces to store lots of sub-itemsets which are decomposed from the transaction records of database. However, the EFI algorithm uses the two phrase filtrations to filter out lots of non-frequent itemsets; it only generates the itemsets which are the most possible to be the frequent itemsets. So, the EFI algorithm can decrease a large number of non-frequent itemsets and increase the utility rate of memory. The size of the databases in the real world is always greater than the size of the memory. In order to solve this problem, the EFI algorithm divides a large database into many sub-databases and mines association rules from those sub-databases. The EFI algorithm only scans database four times and will not be affect by the length of frequent itemsets. The EFI algorithm avoids wasting a lot of I/O time and increases the efficiency and the practicability in application.

主题分类 基礎與應用科學 > 資訊科學
社會科學 > 管理學
参考文献
  1. F. C. Tseng,C. C. Hsu(2001).Creating frequent patterns with the frequent pattern list.376-386.
  2. J. Han,J. Pei,Y. Yin(2000).Mining Frequent Patterns without Candidate Generation.1-12.
  3. J. S. Park,M. S. Chen,P. S. Yu(1995).An Effective Hash-based Algorithm for Mining Association Rules.175-186.
  4. R. Agrawal,R. Srikant(1994).Fast Algorithm for Mining Association Rules in Large Databases.487-499.
  5. S Brin,R. Motwani,J. D. Ullman,S. Tsur(1997).Dynamic Itemset Counting and Implication Rules for Market Basket Data.ACM SIGMOD Conference on Management of Data,255-264.
  6. 黃仁鵬、陳秀如(2004)。碩士論文(碩士論文)。南台科技大學資訊管理研究所碩士論文。
  7. 黃仁鵬、黃南傑(2004)。碩士論文(碩士論文)。南台科技大學資訊管理研究所碩士論文。
  8. 黃仁鵬、熊浩志(2005)。碩士論文(碩士論文)。南台科技大學資訊管理研究所碩士論文。
  9. 黃仁鵬、熊浩志、郭煌政(2004)。直覺拆解之關聯法則演算法-IDA。第十屆責訊管理暨實務研討會,1857-1866。
  10. 黃仁鵬、錢依佩、吳聲弘(2003)。高效率之關聯規則探勘演算法-ICI。第十四屆國際資訊管理學術研討會
  11. 微軟公司(2000)。Microsoft Analysis services的範本倉儲資料庫FoodMart。
被引用次数
  1. 胡宜中、林震岩、林雅惠(2011)。運用關聯規則和序列型樣探討投資地區之關聯性與遷移─以印刷電路板產業為例。明新學報,37(1),217-230。
  2. 楊亞澄、翁政雄、胡雅涵(2016)。運用關聯規則及改變探勘技術於防火牆政策規則優化。資訊管理學報,23(3),277-304。