题名

社會科學研究中的文字探勘應用:以文意為基礎的文件分類及其問題

并列篇名

Text Mining for Social Studies: Meaning-based Document Classification and Its Problems

作者

陳世榮(Roger S. Chen)

关键词

文字探勘 ; 文意區別 ; 文件分類 ; 機器學習 ; 共詞網絡分析 ; text mining ; meaning differentiation ; document classification ; machine learning ; co-word network analysis

期刊名称

人文及社會科學集刊

卷期/出版年月

27卷4期(2015 / 12 / 01)

页次

683 - 718

内容语文

繁體中文

中文摘要

隨著電子典藏技術的精進,文字探勘技術逐漸受到重視,本文以社會科學研究在文意區別上的需求,評估監督式機器學習對非結構、複雜文本的分類效果,並就所見問題提出分析與建議。本文從文字探勘與內容分析文意區別上的差異與共通性出發,繼而以新聞報導為分析資料,針就特定文件意向,遵循一般文字探勘程序,以支持向量機與簡易貝式分類器執行文件分類評估。分析結果指出,文字探勘對於複雜文意的判讀效果值得肯定,但經由共詞網絡分析也發現,文件的編撰風格將影響文件分類的效果。建議研究者在資料處理初期,應反覆評估研究目的、資料特性與分類器模型間的契合度。

英文摘要

Along with the growing development of electronic information storage, text mining has increasingly gained attention from scholars and practitioners across various disciplines. In response to the need for meaning differentiation in social studies, the study aims to evaluate supervised machine learning classifiers in terms of the performance of document classification. Setting out from the comparison between traditional content analysis and text mining, the evaluation follows a normal procedure of text mining and applies Support Vector Machine and Naïve Bayes classifiers on non-structural, complex social texts extracted from news media. The outcomes of the analysis validate that text mining manages classification well for documents with complex meaning. However, a further co-word network analysis in the study finds that the editing style of data may affect classifiers' performance. It is suggested that, in the early stage of data processing, greater care must be given to the fit between research problems, editing styles, and classifiers.

主题分类 人文學 > 人文學綜合
社會科學 > 社會科學綜合
参考文献
  1. 尹其言、楊建民(2010)。應用文件分群與文字探勘技術於機器學習領域趨勢分析以SSCI 資料庫為例。長榮大學學報,14(2),1-16。
    連結:
  2. 李政儒、游基鑫、陳信希(2012)。廣義知網詞彙意見極性的預測。中文計算語言學期刊,17(2),21-36。
    連結:
  3. 林琬真、郭宗廷、張桐嘉、顏厥安、陳昭如、林守德(2012)。利用機器學習於中文法律文件之標記、案件分類及量刑預測。中文計算語言學期刊,17(4),49-68。
    連結:
  4. 林頌堅(2010)。利用自組織映射圖技術的研究主題視覺呈現及其在資訊傳播學領域的應用。圖書資訊學研究,5(1),23-49。
    連結:
  5. 戚玉樑、蔡明宏(2007)。以文件為對象的概念萃取程序建立知識本體的雛型架構。資訊管理學報,14(3),47-66。
    連結:
  6. 許中川、陳景揆(2001)。探勘中文新聞文件。資訊管理學報,7(2),103-122。
    連結:
  7. 陳文華、徐聖訓、施人英、吳壽山(2003)。應用主題地圖於知識整理。圖書資訊學刊,1(1),37-58。
    連結:
  8. 游美惠(2000)。內容分析、文本分析與論述分析在社會研究的運用。調查研究—方法與應用,8,5-42。
    連結:
  9. 楊善順、吳世弘、陳良圃、邱宏昇、楊仁達(2013)。蘊涵句型分析於改進中文文字蘊涵識別系統。中文計算語言學期刊,18(4),1-16。
    連結:
  10. 蘇中信(2012)。以紮根理論探討台灣商管期刊中內容分析法的類型。人文社會科學研究,6(2),1-23。
    連結:
  11. 自由時報2007–2008 《自由時報電子報》。2013 年3 月1 日—2013 年8 月31 日,取自http://news.1tn.com.tw/search (Liberty Times, 2007–2008, Liberty Times Net. Retrieved March 1,2013–August 31, 2013, from http://news.1tn.com.tw/search)
  12. 聯合報2007–2008《聯合知識庫》。2013年3月1日—2013年8月31日,取自http://udndata.com/udn (United Daily News, 2007–2008, Udndata.com. Retrieved March 1, 2013–August 31, 2013, from http://udndata.com/udn)
  13. 中央研究院資訊所2003《中文斷詞系統》。2013 年5 月1 日—2013 年10 月31 日,取自http://ckipsvr.iis.sinica.edu.tw/ (Academia Sinica Institute of Information Science, 2003, Chinese Knowledgeand Information Processing. Retrieved May 1, 2013–October 31, 2013, from http://ckipsvr.iis.sinica.edu.tw/)
  14. Alexa, Melina(1997).Computer-assisted Text Analysis Methodology in the Social Sciences.
  15. Blake, Catherine(2011).Text Mining.Annual Review of Information Science and Technology,45,123-155.
  16. Borgatti, Stephen P.,Everett, Matin G.(1997).Network Analysis of 2-Mode Data.Social Networks,19(3),243-269.
  17. Caruana, Rich,Munson, Art,Niculescu-Mizil, Alexandru(2006).Getting the Most Out of Ensemble Selection.ICDM'06: Proceedings of the Sixth International Conference of Data Mining,Washington,DC:
  18. Caruana, Rich,Niculescu-Mizil, Alexandru,Crew, Geoff,Ksikes, Alex(2004).Ensemble Selection from Libraries of Models.Proceedings of the Twenty-first International Conference on Machine Learning,New York:
  19. Cortes, Corinna,Vapnik, Vladimir(1995).Support-vector Networks.Machine Learning,20(3),273-297.
  20. Cristianini, Nello,Shawe-Taylor, John(2000).An Introduction to Support Vector Machines and Other Kernel-based Learning Methods.New York:Cambridge University Press..
  21. Feldman, Ronen,Sanger, James(2007).The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data.New York:Cambridge University Press.
  22. Franzosi, Roberto(ed.)(2008).Content Analysis.London:SAGE.
  23. Glaser, Barney G.,Strauss, Anselm L.(1967).The Discovery of Grounded Theory: Strategies for Qualitative Research.Chicago:Aldine Pub. Co..
  24. Hand, David J.(2006).Classifier Technology and the Illusion of Progress.Statistical Science,21(1),1-15.
  25. Hanneman, Robert A.、Riddle, Mark 、陳世榮譯(2013)。社會網絡分析方法:UCINET 的應用。高雄=Kaohsiung:巨流=Chuliu。
  26. Holsti, Ole R.(1969).Content Analysis for the Social Sciences and Humanities.Reading, MA:Addison-Wesley..
  27. Hopkins, Daniel J.,King, Gary(2010).A Method of Automated Nonparametric Content Analysis for Social Science.American Journal of Political Science,54(1),229-247.
  28. Junqué de Fortuny, Enric,De Smedt, Tom,Martens, David,Daelemans, Walter(2012).Media Coverage in Times of Political Crisis: A Text Mining Approach.Expert Systems with Applications,39(14),11616-11622.
  29. Kohavi, Ron(1995).A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection.Proceedings of the 14th International Joint Conference on Artificial Intelligence,San Francisco, CA:
  30. Kohavi, Ron,Provost, Foster(1998).Glossary of Terms.Machine Learning,30(2-3),271-274.
  31. Krippendorff, Klaus(2013).Content Analysis: An Introduction to Its Methodology.Thousand Oaks, CA:Sage.
  32. Krippendorff, Klaus H.(ed.),Bock, Mary A.(ed.)(2009).The Content Analysis Reader.Thoundand Oaks, CA:SAGE.
  33. Lasswell, Harold D.(ed.),Leites, Nathan(ed.),Associates(ed.)(1965).Language of Politics: Studies in Quantitative Semantics.Cambridge,MA:The MIT Press.
  34. Laver, Michael,Garry, John(2000).Estimating Policy Positions from Political Texts.American Journal of Political Science,44(3),619-634.
  35. Leetaru, Kalev Hannes(2012).Data Mining Methods for the Content Analyst: An Introduction to the Computational Analysis of Content.New York:Routledge.
  36. Luck, Edward C.(1999).Mixed Messages: American Politics and International Organization, 1919-1999.Washington, DC:Brookings Institution Press.
  37. Miner, Gary(ed.),Delen, Dursun(ed.),Elder, John(ed.),Fast, Andrew(ed.),Hill, Thomas(ed.),Nisbet, Robert A.(ed.)(2012).Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications.Waltham, MA:Elsevier/Academic Press.
  38. Pang, Bo,Lee, Lillian,Vaithyanathan, Shivakumar(2002).Thumbs Up? Sentiment Classification Using Machine Learning Techniques.Proceedings of the ACL-02 Conference on Empirical Methods in Natural Languate Processing
  39. Rockwell, Patricia A.(2006).Sarcasm and Other Mixed Messages: The Ambiguous Ways People Use Language.Lewiston, NY:Edwin Mellen Press.
  40. Russell, Stuart、Norvig, Peter、歐崇明編譯、時文中編譯、陳龍編譯(2011)。人工智慧:現代方法。新北市=New Taipei:全華圖書=OpenTech。
  41. Salton, Gerard,Buckley, Christopher(1988).Term-weighting Approaches in Automatic Text Retrieval.Information Processing & Management,24(5),513-523.
  42. Srivastava, Ashok N.(ed.),Sahami, Mehran(ed.)(2009).Text Mining: Classification, Clustering, and Application.Boca Raton, FL:CRC Press.
  43. Sullivan, Dan(2001).Document Warehousing and Text Mining Techniques for Improving Business Operations, Marketing, and Sales.New York:John Wiley & Sons.
  44. Tufféry, Stéphane(2011).Data Mining and Statistics for Dicision Making..Chichester, UK:John Wiley &Sons..
  45. Watts, Duncan J.,Strogatz, Steven(1998).Collective Dynamics of 'Small-World' Networks.Nature Australia,393(6684),440-442.
  46. Witten, Ian H.,Frank, Eibe,Hall, Mark A(2011).Data Mining: Practical Machine Learning Tools and Techniques.Burlington,MA:Morgan Kaufmann.
  47. Yang, Yiming,Liu, Xin(1999).A Re-examination of Text Categorization Methods.Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1999).,New York:
  48. 俞士汶(2003)。計算語言學概論。北京=Beijing:商務印書館=The Commercial Press。
  49. 施百俊、施如齡(2006)。以文字探勘技術探究部落格之網路媒體特性。淡江人文社會學刊,28,95-122。
  50. 施祖琪、臧國仁(2003)。再論風格與新聞風格—以「綜合月刊」為例。新聞學研究,77,143-185。
  51. 曾元顯(2002)。文件主題自動分類成效因素探討。中國圖書館學會會報,68,62-83。
  52. 臧國仁、施祖琪(1999)。新聞編採手冊與媒介組織特色—風格與新聞風格。新聞學研究,60,1-38。
  53. 賴志遠、王玳琪、吳騏、張嘉珍、葉乃菁(2009)。文字探勘在科技政策研究之應用。臺北=Taipei:財團法人國家實驗研究院科技政策研究與資訊中心=Science & Technology Policy Research and Informaiton Center, National Applied Research Laboratories。
  54. 瞿海源(1982)。論社會科學研究方法的相容性與互補性。社會學理論與方法研討會論文集,臺北=Taipei:
  55. 羅鳳珠編(2004)。語言,文學與資訊。新竹=Hsinchu:清華大學出版社=National Tsing Hua University Press。
被引用次数
  1. 江東美(2017)。財經訊息對匯率的影響-以歐元為例。臺中科技大學財務金融研究所碩士班學位論文。2017。1-46。