题名

探勘中文新聞文件

并列篇名

Data Mining in Chinese News Articles

DOI

10.6382/JIM.200101.0103

作者

許中川(Chung-Chian Hsu);陳景揆(Jing-Kuei Chen)

关键词

文件資料探勘 ; 知識發覺 ; 關鍵詞彙擷取 ; 關聯法則 ; 趨勢分析 ; text data mining ; knowledge discovery ; keyword extraction ; association rules ; trend analysis

期刊名称

資訊管理學報

卷期/出版年月

7卷2期(2001 / 01 / 01)

页次

103 - 122

内容语文

繁體中文

中文摘要

新聞報導每天發生的重要事件,大量的新聞文件中,往往蘊含重要的資訊。文件資料探勘技術用來發覺隱藏在大量文件中的特徵。然而,目前的文件探勘研究集中在歐美語系文件,且代表文件的關鍵詞彙的擷取,都是人工處理。本研究以中文新聞文件為探勘對象,試圖發覺其中隱含的知識。針對新聞文件的特殊結構,在收集關鍵詞彙方面,以混合式斷詞法進行中文斷詞,經過關鍵既有詞彙擷取與關鍵新生詞彙擷取步驟,獲得每篇新聞文件的關鍵詞彙,代表該文件重要概念,供後續探勘之用。在資料探勘方面,首先為切合新聞文件知識開採需求,使用概念階層樹建構背景知識與關鍵詞彙。然後以關聯法則為基礎,我們提出三個改良式關聯模式:第一個是新生詞彙關聯法則,第二個是結構化資料與高頻詞彙關聯,第三個是結構化資料與某同類詞彙關聯;另外,以線性迴歸及卡方分配技術,分別探勘關鍵詞彙的報導趨勢與分佈情況。最後並以實驗驗證此探勘架構的可行性。

英文摘要

News reports important daily events. Implicit information hides in huge collection of news articles. Text data mining technology aims at discovering knowledge hidden in large collection of texts. However, current reported research focus on English texts and keywords are given manually. This paper studied text data mining in Chinese news articles. Utilizing the special structure of news articles, existing keywords and new keywords, representing the content of a news article, are automatically extracted using hybrid segmentation technique. Then, the mining process guided by domain knowledge proceeds. We proposed three types of extended association rules: new keywords association rules, association rules of structured data and high frequency keywords, and association rules of structured data and homogeneous keywords. Further, linear regression technique and Chi-square test technique are used to analyzing the reporting trend of keywords and the distribution of important concepts. Experiments are conducted to verify the feasibility of the proposed architecture.

主题分类 基礎與應用科學 > 資訊科學
社會科學 > 管理學
参考文献
  1. Agrawal, R.,Imielinski, T.,Swami, A.(1993).Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data.Washington:
  2. Aumann, Y.(1999).Proceedings of Third European Conference on KDD (PKDD-99).
  3. Brachman, R. J.,Khabaza, T.,Kloesgen, W.,Piatetsky-Shapiro, G.,Simoudis, E.(1996).Mining Business Database.Communications of the ACM,39(11)
  4. Brin, S.,Motwani, R.,Ullman, J. D.,Tsur, S.(1997).SIGMOD 1997, Proceedings of the ACM-SIGMOD International Conference on Management of Data.Tucson, Arizona:ACM Press.
  5. Chen, K. J.,Kiu, S. H.(1992).Fifth International Conference on Computational Linguistics.
  6. Chien, L.-F.(1997).Proceedings of The 20th Annual ACM SIGIR Conference on Research and Development in Information Retrieval.
  7. Cho, V.,Wuthrich, B.(1999).Proceedings of 3'rd Pacific-Asia Conference on KDD (PAKDD-99).
  8. Dhar, V.,Tuzhilin, A.(1993).Abstract-Driven Pattern Discovery in Databases.IEEE Transactions on Knowledge and Data Engineering,5(6)
  9. Dorre, J.,Gerstl, P.,Seiffert, R.(1999).Proceedings of The 5'S ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  10. Fan, C. K.,Tsai, W. H.(1998).Automatic Word Identification in Chinese Sentences by the Relaxation Technique.Computer Processing of Chinese and Oriental Languages
  11. Fayyad, U.,Piatetsky-Shapiro, G.,Smyth, P.(1996).Advances in Knowledge Discovery and Data Mining.
  12. Fayyad, U.,Piatetsky-Shapiro, G.,Smyth, P.(1996).The KDD Process for Extracting Useful Knowledge from Volumes of Data.Communications of the ACM,39(11)
  13. Fayyad, U.,Uthurusamy, R.(1996).Data mining and knowledge discovery in databases.Communications of the ACM,39(11)
  14. Feldman, R.(1998).Text Mining at the Term Level.Journal of Intelligent Information Systems
  15. Feldman, R.(1997).Proceeding of First European Symposium on Principles of Data Mining and Knowledge Discovery.
  16. Feldman, R.(1998).2'nd European Conference on KDD.
  17. Feldman, R.,Dagan, I.(1995).Proceedings of The first ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  18. Feldman, R.,Dagan, I.(1998).Mining Text Using Keyword Distribution.Journal of Intelligent Information Systems,10
  19. Feldman, R.,Hirsh, H.(1996).Proceedings of 2'nd international Conference on Knowledge Discovery and Data Mining.
  20. Feldman, R.,Hirsh, H.(1997).Exploiting Background Information in Knowledge Discovery from Text.Journal of Intelligent Information Systems,9
  21. Feldman, R.,Klosgen, W.,Zilberstein, A.(1997).Proceedings of The Third International Conference on Knowledge Discovery & Data Mining.
  22. Han, J.,Cai, Y.,Cercone, Nick(1993).Data-Driven Discovery of Quantitative Rules in Relational Databases.IEEE Transactions on Knowledge and Data Engineering,5(1)
  23. Keller, G.,Warrack, B.,Bartel, H.(1994).Statistic for Management and Economics.Belmont California:Duxbury Press.
  24. Lagus, K.,Honkela, T.,Kaski, S.,Kohonen, T.(1996).Proceedings of Conf. no Knowledge Discovery and Data Mining.
  25. Lent,B.,Agrawal, R.,Srikant, R.(1997).Proceedings of Conference On Knowledge Discovery and Data Mining.
  26. Li, B.- I.(1991).R. O. C. Computational Linguistics Conference.Taiwan:
  27. Nie, J.,Briscbois, M.,Ren, X.(1996).Conference Proceedings of SIGIR.
  28. Shewhart, M.,Wasson, M.(1999).Proceedings of The 5th Int'1 Conf. On Knowledge Discovery and Data Mining.
  29. Singh, L.(1999).An Algorithm for Constrained Association Rule Mining in Semi-structured Data.
  30. Singh, L.,Scheuermann, P.,Chen, B.(1997).Generating Association Rules from Semi-Structured Documents Using an Extended Concept Hierarchy.
  31. Sproat, R.,Shih, C.(1990).A Statistical Method for Finding Word Boundaries in Chinese Text.Computer Processing of Chinese and Oriental Languages
  32. Webb, G. I.(1995).OPUS: An Efficient Admissible Algorithm for Unordered Search.Journal of Artificial Intelligence Research,3
  33. Wuthrich, B.(1998).IEEE International Conference on SMC.
  34. 中文詞知識庫小組(1993)。新聞語料詞頻統計表。南港:中央研究院。
  35. 中文詞知識庫小組(1995)。中央研究院平衡語料庫。南港:中央研究院。
  36. 陳克健 Chen, Keh-Jiann、陳正佳 、林隆基(1986)。中文語句的研究-斷詞與構詞。南港:中央研究院。
被引用次数
  1. 陳滄堯、戚玉樑、陳滄堯、戚玉樑(2013)。參與式競爭智慧知識系統的企業決策應用。電子商務學報,15(4),541-566。
  2. 陳世榮(2015)。社會科學研究中的文字探勘應用:以文意為基礎的文件分類及其問題。人文及社會科學集刊,27(4),683-718。
  3. 陳文華、徐聖訓、施人英、吳壽山(2003)。應用主題地圖於知識整理。圖書資訊學刊,1(1),37-58。
  4. 賴美惠、鄭麗珍(2011)。結合知識地圖之公部門陳訴文件自動化分案系統。資訊管理學報,18(4),1-20。
  5. 施百俊、施如齡(2006)。「中草藥用藥」之主題地圖式數位學習教材建構與應用。教育資料與圖書館學,44(2),215-233。
  6. 蘇建源、邱宏彬(2004)。一個可彈性支援顧客關係管理與資料庫行銷之模糊RFM Model。電子商務學報,6(2),149-174。
  7. 王琳(2016)。中國當代藝術設計理論研究的歷史回顧與特徵分析—基於對1956 年~2015 年《裝飾》期刊文本探勘技術的分析。東亞研究,47(1),81-112。
  8. 楊誌欽、黃清俊、陶幼慧(2006)。網路論壇FAQ知識之自動轉換設計。資訊管理學報,13(2),89-112。
  9. 張海青、高淑珍、林清河(2003)。基於資料探勘之圖書館介購預算分配決策模式。資訊管理學報,9(2),129-145。
  10. (2006)。以文字探勘技術探究部落格之網路媒體特性。淡江人文社會學刊,28,95-121。