题名

以文件倉儲概念實現動態群聚與多重文件摘要之研究-以中文電子新聞為例

并列篇名

A Study on Multi-Document Summarization Based on Document Warehousing and Dynamic Clustering-Using Internet News as Examples

DOI

10.6382/JIM.200607.0153

作者

魏玲玉(Ling-Yu Wei);曾守正(Frank S.C Tseng)

关键词

資訊檢索 ; 文件倉儲 ; 多文件摘要 ; 文件群聚 ; Information Retrieval ; Document Warehouse ; Multi-Document Summarization ; Document Clustering

期刊名称

資訊管理學報

卷期/出版年月

13卷3期(2006 / 07 / 01)

页次

153 - 176

内容语文

繁體中文

中文摘要

由於電子文件的數量成爆炸性成長,如何有效率地將文件歸納,以方便日後快速瀏覽與查詢,已經是知識管理領域中刻不容緩的課題。傳統上仰賴反轉索引檔(Inverted Index File)為基礎的全文檢索技術,往往搜尋出相當龐大且雜亂的文件資料,所以還需經過進一步的篩選,才能找到真正有用的文件。這樣的應用模式已經無法滿足使用者快速瀏覽與查詢的需求。在本論文中,我們應用文件倉儲的概念將文件予以結構化儲存,西己合多維度查詢的機制,找出具有相關性的文件以進行多重文件摘要與動態群聚之研究。整體概念透過實作DNCSS系統(Dynamic News Clustering and summarization System)來驗證其效果,我們應用資料倉儲處理數值資料的概念來處理文件資料,建立文件倉儲將文件所包含的結構化資訊應用在文件儲存、搜尋與整合上,並提供多維度查詢。更運用動態群聚的概念,幫助使用者組織對文件倉儲作查詢所回傳之查詢結果。最後以多文件摘要系統對每一個文件群聚結果產生一份多文件摘要,方便使用者瀏覽文件集合的精要內容,以更有效率的方式取得有用的資訊。我們以台灣地區各大網路新聞文件為實例來驗證本系統之效果,經人工評估後獲得相當正面之評價,顯示本研究確實能提供使用者快速且有效地獲取符合需求的文件資訊。

英文摘要

As electronic documents proliferate drastically, for contemporary knowledge management, it is indispensable to provide a mechanism for integrating and sorting huge volume of documents for quick browsing and efficient query processing. Traditionally, full-text searching systems were usually based on inverted-index, which is usually huge in volume and unsorted. That makes users suffer from easily determining the information embedded in the collection. Therefore, for document searching over the Internet, such systems are no longer satisfactory for user's need. In this paper, we propose a general framework for document clustering and multi-document summarization based on the concept of document warehousing. Based on our framework, we have implemented a prototype system, named DNCSS (Dynamic News Clustering and Summarization System) to be the test bed of our approach. The system adopts the concept of document warehousing, which models text-oriented documents into multi-dimensional viewpoints. The constructed document warehouse can be regarded as the main repository for our system and it flexibly organizes document structure information for user's searching and querying. Moreover, the retrieved documents from the document warehouse will be further clustered by some clustering techniques to provide a more organized structure. Finally, our system generates a multi-document summary for each cluster to support users finding distilled information more efficiently. We have collected the most famous on-line news in TAIWAN from the Internet as the testing examples to verify the effectiveness of our system. The evaluation result shows that our approach positively alleviates users from reading large amount of related news and elaborating the necessary conclusion effectively.

主题分类 基礎與應用科學 > 資訊科學
社會科學 > 管理學
参考文献
  1. Bleyberg, M.Z.,Ganesh, K.(2000).Dynamic multi-dimensional models for text warehouses.IEEE International Conference on Systems, Man, and Cybernetics
  2. Bleyberg, M.Z.,Paranjape, P.S.(2001).A content delivery strategy for text warehouses.IEEE International Conference on Systems, Man, and Cybernetics
  3. Carey, M.,Kriwaczek, F.,Ruger, S.(2000).Proc. of Workshop on the New Paradigms in Information Visualization and Manipulation (NPIVM'2000).Washington, D.C.:
  4. Chen, K.J.,Kiu, S.H.(1992).Word identification for mandarin chinese sentences.The Fifth International Conference on Computational Linguistics
  5. Edmundson, H.P,Wyllys, R.E.(1961).Automatic Abstracting and indexing-survey and recommendations.Communications of the ACM,4(5),226-234.
  6. Guo, Y.,G. Stylios(2003).A new multi-document summarization system.Proceedings of 2003 Workshop on Text Summarization (with the 2003 Human Language Technology Conference),Edmonton, Canada:
  7. Hahn, U.,Mani, I.(2000).The challenges of automatic summarization.IEEE Computer,33(1),29-36.
  8. Hearst, M.A.,Pedersen, J.O.(1996).Reexamining the cluster hypothesis: Scatter/gather on retrieval results.Proceedings of the 19th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval (Zurich, Switzerland),New York:
  9. Hovy, E.H.,Lin, C.-Y.(1998).TIPSTER Text Program Phase III final reportTIPSTER Text Program Phase III final report,未出版
  10. Jain, A.K.,Dubes, R.(1988).Algorithms for Clustering Data.Englewood Cliffs, NJ:Prentice-Hall.
  11. Jain, A.K.,Murty, M.N.,Flynn, P.J.(1999).Data Clustering: A Survey.ACM Computing Surveys,31(3),264-323.
  12. Kaufman, L.,Rousseeuw, P.J.(1990).Finding Groups in Data: an Introduction to Cluster Analysis.John Wiley & Sons.
  13. Lee, J.,Grossman, D.,Frieder, O.,McCabe, M.C.(2000).Integrating Structured Data and Text: A Multi-dimensional Approach.Proc. International Conference on Information Technology: Coding and Computing
  14. Leuske, A.(2001).Evaluating document clustering for interactive information retrieval.Proceedings of 10th International Conference on Information and Knowledge Management (CIKM'01)
  15. MacQueen, J.B(1967).Some Methods for Classification and Analysis of Multivariate Observations.Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley
  16. Mani, I.,House, D.,Klein, G.,Hirschman, L.,Obrst, L.,Firmin, T.,Chrzanowski, M.,Sundheim, B.(1998).The TIPSPER SUMMAC Text Summarization Evaluation.Automatic Text Summarization Conference
  17. McCabe, M.C.,Lee, J.,Chowdhury, A.,Grossman, D.,Frieder, O.(2000).On the Design and Evaluation of a Multi-dimensional Approach to Information Retrieval.Proceedings of the Nineteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
  18. Meyrowitz, N.,van Dam, A.(1982).Interactive Editing Systems.ACM Computing Surveys,14(3),321-415.
  19. Nie, J. Y.,Brisebois, M.,Ren, X.(1996).On Chinese text retrieval.Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval
  20. Implementation of term weighting in a simple IR system
  21. Salton, G.(1988).Automatic Text Processing: the Transformation, Analysis, and Retrieval of Information by Computer.Addison-Wesley Publishing Company.
  22. Salton, G.,McGill, M. J.(1983).Introduction to Modern Information Retrieval.McGraw-Hill.
  23. Salton, G.,Wong, A.,Yang, C.S.(1975).A vector space model for automatic indexing.Communications of the ACM,18(11),613-620.
  24. Sebastiani, F.(2002).Machine Learning in Automated Text Categorization.ACM Computing Surveys,34(1),1-47.
  25. Sproat, R.,Shih, C.(1990).A Statistical Method for Finding Word Boundaries in Chinese Text.Computer Processing of Chinese and Oriental Languages,4(4),336-351.
  26. Sullivan, D.(2001).Document Warehousing and Text Mining: Techniques for Improving Business Operations, Marketing and Sales.John Wiley & Son, Inc..
  27. Trigg, R.H.,Weiser, M.(1987).Text Net: A Network Based Approach to Text Handling.ACM Transactions on Office Information Systems,4(1),1-23.
  28. Tseng, F.S.C.(2005).Design of a Multi-Dimensional Query Expression for Document Warehouses.Information Sciences,174(1-2),55-79.
  29. Tseng, F.S.C.,Chou, A.Y.H.(2006).The Concept of Document Warehousing for Content Management of Enterprise Business Intelligence.Decision Support Systems.
  30. Tseng, F.S.C.,Lin, W.-P.(2006).D-Tree: A Multi-Dimensional Indexing Structure for Constructing Document Warehouses.Journal of Information Science and Engineering
  31. Van Rijsbergen, C. J.(1979).Information Retrieval.Buttersworth, London:
  32. Wu, M.,Fuller, M.,Wilkinson, R.(2001).Using clustering and classification approaches in interactive retrieval.Information Processing and Management,37(3),459-484.
  33. Yeh, C.L.,Lee, H.J.(1991).Rule-Based Word Identification for Mandarin Chinese Sentences-A Unification Approach.Computer Processing of Chinese and Oriental Languages,5(2),97-118.
  34. Zamir, O.,Etzioni, O.(1998).Web document clustering: a feasibility demonstration.Proceedings of the 21st Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval,Melbourne, Australia:
  35. 陳永德(1997)。碩士論文(碩士論文)。國立臺灣大學心理學研究所。
  36. 曾元顯(1997)。關鍵詞自動擷取技術之探討。中國圖書館學會會訊,5(3),26-29。
被引用次数
  1. 黃仁鵬、張貞瑩(2014)。運用詞彙權重技術於自動文件摘要之研究。資訊管理學報,21(4),391-416。
  2. (2009)。如何提升翻譯記憶體的資料重複使用之方法—以Trados軟體為例。淡江人文社會學刊,37,119-140。