题名

個人化的網頁摘要文件分群系統

并列篇名

A Personal Search System with the Clustering Ability

作者

陳林志(Lin-Chih Chen);林育任(Yu-Ren Lin)

关键词

網頁摘要文件分群 ; 個人化搜尋引擎 ; 階層式分群 ; 分群標籤 ; 元搜尋技術 ; Web-Snippet Clustering ; Personal Search Engine ; Hierarchical Clustering ; Clustering Label ; Metasearch Technique

期刊名称

資訊管理學報

卷期/出版年月

20卷1期(2013 / 01 / 01)

页次

97 - 129

内容语文

繁體中文

中文摘要

本論文發展了一套具有分群能力之個人化系統,Personalization Web-Snippet Clustering System(PWSC),此系統是基於元搜尋技術。此系統的第一階段根據使用者所輸入之查詢,針對不同搜尋引擎匯集相關網頁摘要文件。第二階段,透過Mean Reciprocal Rank(MRR)計算模型重新排列網頁摘要文件。第三階段,將收集到的網頁摘要文件,經由N字詞語言模型產生分群標籤。第四階段,依據分群標籤建構出階層式分群。最後階段為建立個人化系統,其能依據使用者所選擇的標籤及運算,產生不同的搜尋結果,這樣將能幫助使用者快速尋找想要的資訊。根據實驗結果,本系統的性能優於商業和學術系統。

英文摘要

In this paper, we develop a personal search system with the clustering ability, called Personalization Web-Snippet Clustering System (PWSC) that is based on a Metasearch technique. The first stage of the system is to collect the relevant snippets from different search engines based on the user's query. The second stage is to rearrange the weight of the collected snippets based on a Mean Reciprocal Rank (MRR) measure. The third stage is to use word N-gram for language model to generate the clustering labels from our collected snippets. The fourth stage is to build a hierarchical tree based on all clustering labels. The final stage is to build a personal search system by the user to select some of the most interesting labels and operations to help the user quickly locate information of interest. According to all experiment results, the performance of our system is superior to the commercial and academic systems.

主题分类 基礎與應用科學 > 資訊科學
社會科學 > 管理學
参考文献
  1. iBoogie (2011), ‘iBoogie - metasearch document clustering engine and personalized search engines directory', available at http://www.iboogie.com/ (accessed 11 September 2012)
  2. Carrot2 (2011), ‘Carrot2 clustering engine', available at http://search.carrot2.org/stable/search (accessed 11 September 2012)
  3. Alpert, J. and Hajaj, N. (2008), ‘Official Google blog: we knew the Web was big', available at http://0rz.tw/9TuEV (accessed 11 September 2012)
  4. Google (2011), ‘Google Zeitgeist 2010', available at http://www.google.com/intl/en/press/zeitgeist2010/ (accessed 11 September 2012)
  5. Vivisimo (2011), ‘Vivisimo information optimization', available at http://vivisimo.com/ (accessed 11 September 2012)
  6. WebClust (2011), ‘WebClust - clustering search engine', available at http://www.webclust.com/ (accessed 11 September 2012)
  7. Porter, M. and Boulton, R. (2007), ‘Snowball: a language for stemming algorithms', available at http://snowball.tartarus.org/ (accessed 11 September 2012)
  8. Yahoo (2012), ‘My Yahoo', available at http://my.yahoo.com/ (accessed 11 September 2012)
  9. Hazel, P. (2012), ‘PCRE - Perl compatible regular expressions', available at http://www.pcre.org/ (accessed 11 September 2012)
  10. Google (2010), ‘Google Trends', available at http://www.google.com/trends (accessed 11 September 2012)
  11. (2001).Encyclopedia of Library and Information Science.New York, USA:Marcel Decker.
  12. Yahoo (2011), ‘Yahoo! 2010 year in review - top 10 searches', available at http://yearinreview.yahoo.com/2010/us_top_10_searches (accessed 11 September 2012)
  13. Yippy (2011), ‘Yippy clustering engine', available at http://www.yippy.com/ (accessed 11 September 2012)
  14. Google (2012), ‘Google search history', available at https://www.google.com/history/ (accessed 11 September 2012)
  15. comScore (2011), ‘comScore releases May 2011 U.S. search engine rankings', available at http://0rz.tw/sPQ6O (accessed 11 September 2012)
  16. DMOZ (2011), ‘ODP - open directory project', available at http://www.dmoz.org/ (accessed 11 September 2012)
  17. Baeza-Yates, R.,Ribeiro-Neto, B.(1999).Modern Information Retrieval.Boston, Massachusetts:Addison Wesley Press.
  18. Benson, M.(1989).The structure of the collocational dictionary.International Journal of Lexicography,2(1),1-14.
  19. Brown, P. F.,deSouza, P. V.,Mercer, R. L.,Pietra, V. J. D.,Lai, J. C.(1992).Class-based N-gram models of natural language.Computational Linguistics,18(4),467-479.
  20. Carpineto, C.,Mizzaro, S.,Romano, G.,Snidero, M.(2009).Mobile information retrieval with search results clustering: prototypes and evaluations.Journal of the American Society for Information Science and Technology,60(5),877-895.
  21. Carpineto, C.,Osinski, S.,Romano, G.,Weiss, D.(2009).A survey of Web clustering engines.ACM Computing Surveys,41(3)
  22. Carpineto, C.,Romano, G.(2004).Exploiting the potential of concept lattices for information retrieval with CREDO.Journal of Universal Computer Science,10(8),985-1013.
  23. Chen, L. C.(2011).Building a Web-snippet clustering system based on a mixed clustering method.Online Information Review,35(4),611-635.
  24. Chen, L. C.,Luh, C.-J.(2005).Web page prediction from metasearch results.Internet Research: Electronic Networking Applications and Policy,15(4),421-446.
  25. Cilibrasi, R. L.,Vit´anyi, P. M. B.(2007).The Google similarity distance.IEEE Transaction on Knowledge and Data Engineering,19(3),370-383.
  26. Ferragina, P.,Guli, A.(2008).A personalized search engine based on Web-snippet hierarchical clustering.Software: Practice and Experience,38(2),189-225.
  27. Fox, C.(1989).A stop list for general text.ACM SIGIR Forum,24(1-2),19-35.
  28. Frantzi, K.,Ananiadou, S.,Mima, H.(2000).Automatic recognition of multi-word terms: the C-value/NC-value method.International Journal on Digital Libraries,3(2),115-130.
  29. Fung, B. C. M.,Wang, K.,Ester, M.(2003).Hierarchical document clustering using frequent itemsets.Proceedings of the Third SIAM International Conference on Data Mining,San Francisco, California, USA:
  30. Garai, G.,Chaudhuri, B. B.(2004).A novel genetic algorithm for automatic clustering.Pattern Recognition Letters,25(2),173-187.
  31. Giannotti, F.,Gozzi, C.,Manco, G.(2002).Clustering transactional data.Lecture Notes in Computer Science,2431(2002),227-239.
  32. Giannotti, F.,Nanni, M.,Pedreschi, D.,Samaritani, F.(2003).WebCat: automatic categorization of Web search results.Proceedings of the 11th Italian Symposium on Advanced Database Systems,Cosenza, Italy:
  33. Hashemi, R. R.,Ford, C. W.,Vamprooyen, T.,Talburt, J. R.(2002).Extraction of features with unstructured representation from HTML documents.Proceedings of the IADIS International Conference WWW/Internet 2002,Lisbon, Portugal:
  34. Hearst, M. A.,Pedersen, J. O.(1996).Reexamining the cluster hypothesis: scatter/gather on retrieval results.Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval,Zurich, Switzerland:
  35. Horowitz, D.,Kamvar, S. D.(2010).The anatomy of a large-scale social search engine.Proceedings of the 19th International Conference on World Wide Web,Raleigh, NC, USA:
  36. Huang, J. Z.,Ng, M. K.,Rong, H.,Li, Z.(2005).Automated variable weighting in k-means type clustering.IEEE Transaction on Pattern Analysis and Machine Intelligence,27(5),657-668.
  37. Jansen, B. J.,Spink, A.,Koshman, S.(2007).Web searcher interaction with the Dogpile.com metasearch engine.Journal of the American Society for Information Science and Technology,58(8),744-755.
  38. Jeh, G.,Widom, J.(2003).Scaling personalized Web search.Proceedings of the 12th International Conference on World Wide Web,Budapest, Hungary:
  39. MacQueen, J. B.(1967).Some methods for classification and analysis of multivariate observations.Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability,Berkeley, USA:
  40. Manning, C. D.,Schuetze, H.(1999).Foundations of Statistical Natural Language Processing.Massachusetts, USA:MIT Press.
  41. Maxymuk, J.(2008).Searching beyond google.The Bottom Line: Managing Library Finances,21(3),97-100.
  42. Nah, F. F. H.(2004).A study on tolerable waiting time: how long are Web users willing to wait?.Behaviour and Information Technology,23(3),153-163.
  43. Osinski, S.,Weiss, D.(2005).A concept-driven algorithm for clustering search results.IEEE Intelligent Systems,20(3),48-54.
  44. Rijsbergen, C. J. V.(1979).Information Retrieval.Massachusetts, USA:Butterworth-Heinemann.
  45. Segev, A.,Leshno, M.,Zviran, M.(2007).Context recognition using internet as a knowledge base.Journal of Intelligent Information Systems,29(3),305-327.
  46. Speretta, M.,Gauch, S.(2005).Personalized search based on user search histories.Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence,Compiegne, France:
  47. Sun, J.-T.,Zeng, H.-J.,Liu, H.,Lu, Y.,Chen, Z.(2005).CubeSVD: a novel approach to personalized Web search.Proceedings of the 14th international conference on World Wide Web,Chiba, Japan:
  48. Wan, X.(2009).Combining content and context similarities for image retrieval.Lecture Notes in Computer Science,5478(1),749-754.
  49. Weiss, D.,Stefanowski, J.(2003).Web search results clustering in polish: experimental evaluation of carrot.Proceedings of the New Trends in Intelligent Information Processing and Web Mining Conference,Zakopane, Poland:
  50. Wu, Y. F. B.,Rakthin, C.,Li, C.(2002).Summarizing search results with automatic tables of contents.Proceedings of the 8th Americas Conference on Information Systems,Texas, United States:
  51. Wu, Y. F. B.,Shankar, L.,Chen, X.(2003).Finding more useful information faster from Web search results.Proceedings of the 2003 ACM CIKM International Conference on Information and Knowledge Management,New Orleans, Louisiana, United States:
  52. Wu, Y. F.,Chen, X.(2003).Extracting features from Web search returned hits for hierarchical classification.Proceedings of the 2003 International Conference on Information and Knowledge Engineering,Las Vegas, Nevada, USA:
  53. Zamir, O.,Etzioni, O.(1998).Web document clustering: a feasibility demonstration.Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval,Melbourne, Australia:
  54. Zamir, O.,Etzioni, O.(1999).Grouper: a dynamic clustering interface to Web search results.Computer Networks,31(11-16),1361-1374.
  55. Zhao, Y.,Karypis, G.(2002).Evaluation of hierarchical clustering algorithms for document datasets.Proceedings of the 11th International Conference on Information and Knowledge Management,515-524.
被引用次数
  1. 葉國暉、陳林志、陳大仁(2017)。基於時間參數提昇谷歌部落格搜尋引擎效能。資訊管理學報,24(2),155-184。
  2. 葉國暉、陳林志、陳大仁、吳忠澄(2015)。使用語意模型分析線上部落格文件。資訊管理學報,22(3),273-316。