题名

基於高頻項目集結合近似樣式匹配之文件分群

并列篇名

Document Clustering Based on Frequent Itemset Integrated with Approximate Pattern Matching

DOI

10.6382/JIM.200901.0019

作者

楊燕珠(Yen-Ju Yang);陳志豐(Chih-Feng Chen)

关键词

高頻項目集 ; 樣式匹配 ; 特徵抽取 ; 文件分群 ; Frequent Itemset ; Pattern Matching ; Feature Extraction ; Document Clustering

期刊名称

資訊管理學報

卷期/出版年月

16卷_S期(2009 / 01 / 01)

页次

165 - 184

内容语文

繁體中文

中文摘要

網際網路普及,越來越多使用者在網路上搜尋相關資料進行閱讀,本研究目標是將大量文件資料進行主題集群分析,方便使用者能很快瞭解文件集有哪些主題,迅速選擇所需主題的文件進行閱讀。本研究以關聯規則之高頻項目集結合近似樣式匹配,探勘出「近似高頻樣式」作為文件特徵;並將近似匹配的距離(相似度)納入特徵權重的衡量中。此外,本研究提出以「密度和相似度為基礎之二階段分群演算法」,此方法不需預先設定群集數目,適合於大量文件分群。經過實驗結果顯示,「近似高頻樣式」的特徵數量是彈性詞對的1.42倍,單一詞彙的0.84倍,透過此特徵分群,平均召回率、精確率和正確率皆較彈性詞對、相鄰詞對、單一詞彙等特徵的分群結果為高,證明以「近似高頻樣式」確實能抽取出更多有意義且具備區別力的特徵,搭配所提出的分群演算法,可以提昇分群速度,易於決定適當的群數,並提高文件分群的品質與正確性。

英文摘要

Due to the popularization of the Internet, more and more users read desired data by directly searching from the Internet. This research aims to group a large number of texts by thematic document clustering for users rapidly realizing how many topics in those texts and picking up the interested topics to read. In order to extract more meaningful features, we propose an approach integrating frequent itemset with approximate pattern matching to mine the ”Approximate Frequent Patterns”. The distance (similarity) of approximate matching is adopted in measurement of feature weights, which is different from the traditional support count (frequency) of itemsets. In addition, the ”Two-Phase Density and Similarity-Based Clustering Algorithm” is presented. This method doesn't need setting cluster number in advance, so as to be suitable for thematic document clustering. The experimental results show that the number of ”Approximate Frequent Patterns” is 1.42 times of that of flexible word pairs and 0.84 times of that of single terms. Using this feature extraction, the clustering result in average recall, precision and accuracy are all higher than flexible word pairs, bigram and single word. This proves that ”Approximate Frequent Patterns” can really extract more meaningful and discriminative features. Besides, our presented clustering algorithm can promote the speed, easily decide appropriate cluster number, and improve the quality and accuracy of document clustering.

主题分类 基礎與應用科學 > 資訊科學
社會科學 > 管理學
参考文献
  1. Agrawal, R.,Srikant, R.(1994).Fast Algorithms for Mining Association Rules.Proceedings of International Conference on Very Large Data Bases,Santiago, Chile:
  2. Al-Kofahi, K.,Tyrrell, A.,Vachher, A.,Travers, T.,Jackson, P.(2001).Combining Multiple Classifiers for Text Categorization.Proceedings of the Tenth International Conference on Information and Knowledge Management,Atlanta, Georgia, USA:
  3. Baeza-Yates, R.,Ribeiro-Neto, B.(1999).Modern Information Retrieval.Addison Wesley.
  4. Beil, F.,Ester, M.,Xu, X.(2002).Frequent Term-Based Text Clustering.Proceedings of International Conference on Knowledge Discovery and Data Mining
  5. Chen, F.,Han, K.,Chen, G.(2002).An Approach to Sentence-Selection-Based Text Summarization.IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering, (TENCON '02)
  6. Dubes, Richard C.,Jain, Anil K.(1988).Algorithms for Clustering Data.Prentice Hall.
  7. Ester, M.,Kriegel, H.-P.,Sander, J.,Xu, X.(1996).A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.Proceedings of International Conference on Knowledge Discovery and Data Mining
  8. Fung, B. C. M.,Wang, K.,Ester, M.(2003).Herarchical Document Clustering Using Frequent Itemsets.SIAM International Conference on Data Mining
  9. Han, J.,Pei, J.,Yin, Y.(2000).Mining Frequent Patterns without Candidate Generation.Proceedings of the ACM SIGMOD International Conference on Management of Data
  10. Jones, K.S.(1972).A Statistical Interpretation of Terms Specificity and its Application in Retrieval.Journal of Documentation,28(5),111-121.
  11. Liu, X.-W.,He, P.-L.,Wang, H.-Y.(2005).The Research of Text Clustering Algorithms Based on Frequent Term Sets.Proceedings of the Fourth International Conference on Machine Learning and Cybernetics,Guangzhou:
  12. Porter, M.(1980).An Algorithm for Suffix Stripping.Program,14(1),130-137.
  13. Salton, G.,Buckley, C.(1988).Term-weighting Approaches in Automatic Text Retrieval.Information Processing & Management,24(5),513-523.
  14. Salton, G.,McGill, M.(1983).Introduction to Modern Information Retrieval.New York:McGraw-Hill.
  15. Steinbach, M.,Karypis, G.,Kumor, V.(2000).A Comparison of Document Clustering Techniques.Proceedings of International Conference on Knowledge Discovery and Data Mining Workshop on Text Mining
  16. Yang, Y.-J.,Yu, S.-H.(2006).Chinese Text Clustering for Topic Detection Based on Word Pattern Relation.AI-2006 The Twenty-sixth SGAI International Conference on Artificial Intelligence
  17. Zamir, O.,Etzioni, O.(1998).Web Document Clustering: A Feasibility Demonstration.Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval,Melbourne, Australia:
  18. 楊燕珠、王千豪(2007)。基於近似詞彙樣式匹配之主題式文件分群。第13屆海峽兩岸資訊管理發展與策略學術研討會
  19. 楊燕珠、邱瑞民(2007)。未知群數的模糊分群之研究。ICIM 2007 第十八屆國際資訊管理學術研討會