题名

The Feasibility of Automated Topic Analysis: An Empirical Evaluation of Deep Learning Techniques Applied to Skew-Distributed Chinese Text Classification

并列篇名

主題分析自動化的可行性:深度學習技術應用於偏態分佈之中文文件分類的實證評估

DOI

10.6120/JoEMLS.202003_57(1).0047.RS.CE

作者

曾元顯(Yuen-Hsien Tseng)

关键词

Text categorization ; Real-world corpus ; Deep learning ; Performance evaluation ; 文本分類 ; 語料庫 ; 深度學習 ; 績效評估

期刊名称

教育資料與圖書館學

卷期/出版年月

57卷1期(2020 / 03 / 01)

页次

121 - 144

内容语文

英文

中文摘要

Text classification (TC) is the task of assigning predefined categories (or labels) to texts for information organization, knowledge management, and many other applications. Normally the categories are topical in library science applications, although they can be any labels suitable for an application. Thus, TC often requires topical analysis which relies on human knowledge. However, in recent decades, machine learning (ML) techniques have been applied to TC for efficiency, as long as a sufficient number of training texts are available for each category. Nevertheless, in real-world cases, the number of texts (documents) for each category is often highly skewed for a certain TC task. This leads to the problem of predicting labels for small categories, which is viable for humans but challenging for machines. Deep learning (DL) is an emerging class of machine learning (ML) which was inspired by human neural networks. This study aims to evaluate whether DL techniques are feasible for the mentioned problem by comparing the performance of four off-the-shelf DL methods (CNN, RCNN, fastText, and BERT) with four traditional ML techniques on five skew-distributed datasets (four in Chinese, and one in English for comparison). Our results show that BERT is effective for moderately skewed datasets, but is still not feasible for highly skewed TC tasks. The other three DL-aware methods (CNN, RCNN, fastText) do not show any advantage in comparison with traditional methods such as SVM for the five TC tasks, although they captured extra language knowledge in the pretrained word representation. To facilitate future study, all of the Chinese datasets used in this study have been released publicly, together with all of the adapted machine learning and evaluation source codes for verification and for further study at https://github.com/SamTseng/Chinese_Skewed_TxtClf.

英文摘要

文件分類是圖書資訊學中的主題分析問題,而深度學習(deep learning,DL)為近年來運用大量語言知識的語意理解技術。本研究旨在透過四種現成的DL方法(CNN、RCNN、fastText和BERT)與四種傳統機器學習技術,對五個偏斜分佈語料(四個中文和一個英文)做成效比較,來評估DL進行主題分析的可行性。結果顯示,BERT對中等偏斜的語料有效,但對於高度偏斜的文件自動分類任務成效仍不佳。與傳統方法(例如SVM)相比,其他三種DL方法(CNN、RCNN、fastText)在五個文件分類任務上沒有顯示出優勢,儘管它們在預訓練的詞彙表示法中獲取了廣泛的額外語言知識,其成效也沒有比較好。為了方便將來的研究,本研究使用到的中文語料庫以及所有經過改編的機器學習和評估程式碼均公開發布。

主题分类 人文學 > 圖書資訊學
参考文献
  1. Alex, K.,Sutskever, I.,Hinton, G. E.(2012).ImageNet classification with deep convolutional neural networks.NIPS’12: Proceedings of the 25th International Conference on Neural Information Processing Systems
  2. Calkins, S.(1983).The new Merger Guidelines and the Herfindahl-Hirschman Index.California Law Review,71(2),402-429.
  3. Chen, L.,Lee, C. M.(2017).,未出版
  4. Devlin, J.,Chang, M.-W.,Lee, K.,Toutanova, K.(2019).,未出版
  5. Hirschman, A. O.(1964).The paternity of an index.The American Economic Review,54(5),761.
  6. Hochreiter, S.,Schmidhuber, J.(1997).Long short-term memory.Neural Computation,9(8),1735-1780.
  7. Joachims, T.(1998).Text categorization with support vector machines: Learning with many relevant features.Machine learning: ECML-98: 10th European Conference on Machine Learning Chemnitz, Germany, April 21-23, 1998 proceedings
  8. Johnson, R.,Zhang, T.(2015).Effective use of word order for text categorization with convolutional neural networks.Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
  9. Joulin, A.,Grave, E.,Bojanowski, P.,Mikolov, T.(2016).,未出版
  10. Lai, S.,Xu, L.,Liu, K.,Zhao, J.(2015).Recurrent convolutional neural networks for text classification.AAAI’15: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence
  11. LeCun, Y.,Bengio, Y.,Hinton, G.(2015).Deep learning.Nature,521,436-444.
  12. Lewis, D. D.,Yang, Y.,Rose, T. G.,Li, F.(2004).RCV1: A new benchmark collection for text categorization research.Journal of Machine Learning Research,5,361-397.
  13. Liston-Heyes, C.,Pilkington, A.(2004).Inventive concentration in the production of green technology: A comparative analysis of fuel cell patents.Science and Public Policy,31(1),15-25.
  14. Mikolov, T.,Chen, K.,Corrado, G.,Dean, J.(2013).,未出版
  15. Mikolov, T.,Sutskever, I.,Chen, K.,Corrado, G. S.,Dean, J.(2013).Distributed representations of words and phrases and their compositionality.NIPS’13: Proceedings of the 26th International Conference on Neural Information Processing Systems
  16. Pang, B.,Lee, L.,Vaithyanathan, S.(2002).Thumbs up? Sentiment classification using machine learning techniques.EMNLP ‘02: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing
  17. Russakovsky, O. Deng, J.,Su, H.,Krause, J.,Satheesh, S.,Ma, S.,Huang, Z.,Karpathy, A.,Khosla, A.,Bernstein, M.,Berg, A. C.,Li, F.-F.(2015).ImageNet Large Scale Visual Recognition Challenge.International Journal of Computer Vision,115(3),211-252.
  18. Saif, H.,Fernández, M.,He, Y.,Alani, H.(2013).Evaluation datasets for Twitter sentiment analysis: A survey and a new dataset, the STS-Gold.Proceedings of the First International Workshop on Emotion and Sentiment in Social and Expressive Media: Approaches and perspectives from AI
  19. Salton, G.(1989).Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer.Addison-Wesley Longman Publishing.
  20. Sebastiani, F.(2002).Machine Learning in Automated Text Categorization.ACM Computing Surveys,34(1),1-47.
  21. Simpson, E. H. (1949). Measurement of diversity. Nature, 163, 688. https://doi.org/10.1038/163688a0
  22. Sun, Y.,Wang, S.,Li, Y.,Feng, S.,Chen, X.,Zhang, H.,Tian, X.,Zhu, D.,Tian, H.,Wu, H.(2019).,未出版
  23. Sun, Y.,Wang, S.,Li, Y.,Feng, S.,Tian, H.,Wu, H.,Wang, H.(2019).,未出版
  24. Tseng, Y.-H.,Teahan, W. J.(2004).Verifying a chinese collection for text categorization.Proceedings of Sheffield SIGIR - Twenty-Seventh Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
  25. Turney, P. D.(2002).Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews.ACL ‘02: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics
  26. Vaswani, A.,Shazeer, N.,Parmar, N.,Uszkoreit, J.,Jones, L.,Gomez, A. N.,Kaiser, Ł.,Polosukhin, I.(2017).Attention is all you need.Neural Information Processing Systems 30
  27. Witten, I. H.,Frank, E.,Hall, M. A.(2011).Data mining: Practical machine learning tools and techniques.Morgan Kaufmann Publishers.
  28. Yan, L.,Zheng, Y.,Cao, J.(2018).Few-shot learning for short text classification.Multimedia Tools and Applications,77(22),29799-29810.
  29. Yang, Y.,Liu, X.(1999).A re-examination of text categorization methods.SIGIR ‘99: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
  30. Yang, Z.,Dai, Z.,Yang, Y.,Carbonell, J. G.,Salakhutdinov, R.,Le, Q. V.(2019).XLNet: Generalized autoregressive pretraining for language understanding.33rd Conference on Neural Information Processing Systems,Vancouver, Canada:
  31. Zhang, X.,Zhao, J.,LeCun, Y.(2015).Character-level convolutional networks for text classification.NIPS’15: Proceedings of the 28th International Conference on Neural Information Processing Systems
  32. Zhang, Z.,Han, X.,Liu, Z.,Jiang, X.,Sun, M.,Liu, Q.(2019).ERNIE: Enhanced language representation with informative entities.Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
被引用次数
  1. 林巧敏,吳承恩(2023)。檔案內容主題自動分類及其成效評估之研究。檔案半年刊,22(2),34-53。
  2. 顏瑞宏,傅文成(2022)。外交新常態?以主題及網絡建模技術探索中共Twitter外交的戰狼溝通策略。資訊社會研究,43,67-113。
  3. (2024).A Study on the Automatic Classification of Tweets Related to Mental Health Literacy.教育資料與圖書館學,61(1),5-27.