题名

基於詞頻、位置及類別關係的特徵選擇方法

并列篇名

FEATURE SELECTION METHOD BASED ON TERM FREQUENCY, LOCATION AND CATEGORY RELATIONSHIPS

DOI

10.6338/JDA.201910_14(5).0004

作者

顏秀珍(Show-Jane Yen);鄭開元(Kai-Yuan Zheng);李御璽(Yue-Shi Lee);丁明勇(Ming-Yung Di)

关键词

文本分類 ; 特徵選取 ; 文本預處理 ; 分類演算法 ; Text Mining ; Feature Selection ; Text Preprocessing ; Classification

期刊名称

Journal of Data Analysis

卷期/出版年月

14卷5期(2019 / 10 / 01)

页次

73 - 97

内容语文

繁體中文

中文摘要

文本分類為文本探勘的一個重要分支,主要是在給定的分類體系下,將未知的類別文本,通過分類演算法,將文本歸屬於某一特定類別的過程。其廣泛應用於新聞出版品的快速分類、網頁分類、個性化新聞智慧推薦、垃圾郵件過濾、用戶分析等應用。一般的文本分類分為文本預處理、特徵選擇與建立詞向量矩陣、建構分類器,以及最後的分類器性能評估等步驟。良好的特徵選擇方法會直接影響之後的分類效果,故現有特徵選擇方法之改進,值得進一步研究探討。因此,本文針對特徵選擇所存在的不足,引入了特徵詞在文中位置的重要性、類間詞頻頻度、類內詞頻頻度、類間集中度,以及類內分散度這幾個因素,對現有被認爲較好的特徵選擇方法卡方檢定法和期望交叉熵進行改進,提出了基於多個因素共同考慮的中文文本特徵選取方法。實驗結果顯示,本研究改進後所選取的特徵,再以分類演算法對文本進行分類,其分類正確率較其它方法好,且在不平衡文檔分類效果上,也較其它方法更爲穩定。實驗結果也顯示,無論是平衡文檔還是不平衡文檔,本研究提出的特徵選擇方法,相較傳統方法及其它方法,其分類正確率的確具有顯著的提升效果。

英文摘要

Text classification is an important topic of text mining, mainly in the process of assigning text to a specific category by using an algorithm in a given classification system. It is widely used in the rapid classification of news publications, web page classification, personalized news smart recommendation, spam filtering, user analysis and other application scenarios. The general text classification is divided into text preprocessing, feature selection, establishment of word vector matrix, construction of classifiers, and classifier performance evaluation. The feature selection methods will directly affect the subsequent classification effect, so the improvement of the existing feature selection method is worthy of further study. Therefore, in view of the shortcomings of feature selection, this paper introduces the importance of the location of feature words in the text, the frequency of inter-class words, the frequency of intracategory words, the degree of concentration between classes, and the intra-class dispersion. We improved the better feature selection methods: chi-square verification method and the expected cross-entropy, and proposed a text feature selection method based on multiple factors. The experimental results show that the proposed feature selection method actually can improve the classification accuracy rate, and it is more stable than other methods in the imbalanced document classification. The experiments also show that whether it is a balanced document or an imbalanced document, the feature selection method proposed in this study has a significant improvement on the classification accuracy compared with the traditional methods and other methods.

主题分类 基礎與應用科學 > 資訊科學
基礎與應用科學 > 統計
社會科學 > 管理學
参考文献
  1. Aggarwal, Charu C,Zhai, Cheng Xiang(2012).Mining text data.Springer Science &Business Media.
  2. Asim, MN,Wasim, M,Ali, MS(2017).Comparison of feature selection methods in text classification on highly skewed datasets.Proceedings of International Conference on Latest trends in Electrical Engineering & Computing Technologies (INTELLECT)
  3. Basu, Tanmay,Murthy, C. A.(2012).Effective text classification by a supervised feature selection approach.Proceedings of IEEE International Conference on Data Mining Workshops (ICDMW)
  4. Baxendale, P. B.(1958).Machine-made index for technical literature: an experiment.Proceedings of IBM Corp,2(4),354-361.
  5. Berry, Michael W.,Kogan, Jacob(2010).Text mining: applications and theory.John Wiley &Sons.
  6. Cortes, C,Vapnik, V(1995).Support-vector networks.Proceedings of Machine learning
  7. Li, Baoli(2016).Importance weighted feature selection strategy for text classification.Proceedings of International Conference on Asian Language Processing (IALP)
  8. Ronen, Feldman,Sanger, James(2007).The text mining handbook: advanced approaches in analyzing unstructured data.Cambridge University Press.
  9. Şahi̇n, DÖ,Ateş, N,Kiliç, E.(2016).Feature selection in text classification.Proceedings of Signal Processing and Communication Applications
  10. Yang, Y,Liu, X(1999).A re-examination of text categorization methods.International ACM SIGIR Conference on Research and Development in Information Retrieval
  11. Yao, Lifang,Qin, Sijun,Zhu, Huan(2017).Feature selection algorithm for hierarchical text classification using Kullback-Leibler divergence.Proceedings of International Conference on Cloud Computing and Big Data Analysis (ICCCBDA)
  12. 代六玲,黃河燕,陳肇雄(2004)。中文文本分類中特徵抽取方法的比較研究。中文信息學報,1,27-33。
  13. 史忠植(2002).知識發現.北京:清華大學出版社.
  14. 石慧,賈代平,苗培(2014)。基於詞頻信息的改進信息增益文本特徵選擇演算法。計算機應用,11,3279-3282。
  15. 杜同森,周亞建(2013)。北京郵電大學計算機學院。
  16. 肖雪,盧建雲,餘磊,龔恒(2015)。基於最低詞頻CHI的特徵選擇演算法研究。西南大學學報(自然科學版),6,137-142。
  17. 周茜,趙明生,扈旻(2004)。中文文本分類中的特徵選擇研究。中文信息學報,3,18-24。
  18. 胡澤文,王效岳,白如江(2011)。國內外文本分類研究計量分析與綜述。圖書情報工作,6,78-81。
  19. 單麗莉,劉秉權,孫承傑(2011)。文本分類中特徵選擇方法的比較與改進。哈爾濱工業大學學報,S1,319-324。
  20. 程園,吾守爾·斯拉木,買買提依明·哈斯木(2015)。基於綜合的句子特徵的文本自動摘要。計算機科學,4,226-229。
  21. 葉敏,湯世平,牛振東(2017)。一種基於多特徵因子改進的中文文本分類演算法。中文信息學報,4,132-137。
  22. 範明(譯),範宏建(譯),陳封能,斯旦巴赫(2011).資料挖掘導論:完整版.人民郵電出版社.