题名 |
The Feasibility of Automated Topic Analysis: An Empirical Evaluation of Deep Learning Techniques Applied to Skew-Distributed Chinese Text Classification |
并列篇名 |
主題分析自動化的可行性:深度學習技術應用於偏態分佈之中文文件分類的實證評估 |
DOI |
10.6120/JoEMLS.202003_57(1).0047.RS.CE |
作者 |
曾元顯(Yuen-Hsien Tseng) |
关键词 |
Text categorization ; Real-world corpus ; Deep learning ; Performance evaluation ; 文本分類 ; 語料庫 ; 深度學習 ; 績效評估 |
期刊名称 |
教育資料與圖書館學 |
卷期/出版年月 |
57卷1期(2020 / 03 / 01) |
页次 |
121 - 144 |
内容语文 |
英文 |
中文摘要 |
Text classification (TC) is the task of assigning predefined categories (or labels) to texts for information organization, knowledge management, and many other applications. Normally the categories are topical in library science applications, although they can be any labels suitable for an application. Thus, TC often requires topical analysis which relies on human knowledge. However, in recent decades, machine learning (ML) techniques have been applied to TC for efficiency, as long as a sufficient number of training texts are available for each category. Nevertheless, in real-world cases, the number of texts (documents) for each category is often highly skewed for a certain TC task. This leads to the problem of predicting labels for small categories, which is viable for humans but challenging for machines. Deep learning (DL) is an emerging class of machine learning (ML) which was inspired by human neural networks. This study aims to evaluate whether DL techniques are feasible for the mentioned problem by comparing the performance of four off-the-shelf DL methods (CNN, RCNN, fastText, and BERT) with four traditional ML techniques on five skew-distributed datasets (four in Chinese, and one in English for comparison). Our results show that BERT is effective for moderately skewed datasets, but is still not feasible for highly skewed TC tasks. The other three DL-aware methods (CNN, RCNN, fastText) do not show any advantage in comparison with traditional methods such as SVM for the five TC tasks, although they captured extra language knowledge in the pretrained word representation. To facilitate future study, all of the Chinese datasets used in this study have been released publicly, together with all of the adapted machine learning and evaluation source codes for verification and for further study at https://github.com/SamTseng/Chinese_Skewed_TxtClf. |
英文摘要 |
文件分類是圖書資訊學中的主題分析問題,而深度學習(deep learning,DL)為近年來運用大量語言知識的語意理解技術。本研究旨在透過四種現成的DL方法(CNN、RCNN、fastText和BERT)與四種傳統機器學習技術,對五個偏斜分佈語料(四個中文和一個英文)做成效比較,來評估DL進行主題分析的可行性。結果顯示,BERT對中等偏斜的語料有效,但對於高度偏斜的文件自動分類任務成效仍不佳。與傳統方法(例如SVM)相比,其他三種DL方法(CNN、RCNN、fastText)在五個文件分類任務上沒有顯示出優勢,儘管它們在預訓練的詞彙表示法中獲取了廣泛的額外語言知識,其成效也沒有比較好。為了方便將來的研究,本研究使用到的中文語料庫以及所有經過改編的機器學習和評估程式碼均公開發布。 |
主题分类 |
人文學 >
圖書資訊學 |
参考文献 |
|
被引用次数 |
|