题名 |
基於詞頻、位置及類別關係的特徵選擇方法 |
并列篇名 |
FEATURE SELECTION METHOD BASED ON TERM FREQUENCY, LOCATION AND CATEGORY RELATIONSHIPS |
DOI |
10.6338/JDA.201910_14(5).0004 |
作者 |
顏秀珍(Show-Jane Yen);鄭開元(Kai-Yuan Zheng);李御璽(Yue-Shi Lee);丁明勇(Ming-Yung Di) |
关键词 |
文本分類 ; 特徵選取 ; 文本預處理 ; 分類演算法 ; Text Mining ; Feature Selection ; Text Preprocessing ; Classification |
期刊名称 |
Journal of Data Analysis |
卷期/出版年月 |
14卷5期(2019 / 10 / 01) |
页次 |
73 - 97 |
内容语文 |
繁體中文 |
中文摘要 |
文本分類為文本探勘的一個重要分支,主要是在給定的分類體系下,將未知的類別文本,通過分類演算法,將文本歸屬於某一特定類別的過程。其廣泛應用於新聞出版品的快速分類、網頁分類、個性化新聞智慧推薦、垃圾郵件過濾、用戶分析等應用。一般的文本分類分為文本預處理、特徵選擇與建立詞向量矩陣、建構分類器,以及最後的分類器性能評估等步驟。良好的特徵選擇方法會直接影響之後的分類效果,故現有特徵選擇方法之改進,值得進一步研究探討。因此,本文針對特徵選擇所存在的不足,引入了特徵詞在文中位置的重要性、類間詞頻頻度、類內詞頻頻度、類間集中度,以及類內分散度這幾個因素,對現有被認爲較好的特徵選擇方法卡方檢定法和期望交叉熵進行改進,提出了基於多個因素共同考慮的中文文本特徵選取方法。實驗結果顯示,本研究改進後所選取的特徵,再以分類演算法對文本進行分類,其分類正確率較其它方法好,且在不平衡文檔分類效果上,也較其它方法更爲穩定。實驗結果也顯示,無論是平衡文檔還是不平衡文檔,本研究提出的特徵選擇方法,相較傳統方法及其它方法,其分類正確率的確具有顯著的提升效果。 |
英文摘要 |
Text classification is an important topic of text mining, mainly in the process of assigning text to a specific category by using an algorithm in a given classification system. It is widely used in the rapid classification of news publications, web page classification, personalized news smart recommendation, spam filtering, user analysis and other application scenarios. The general text classification is divided into text preprocessing, feature selection, establishment of word vector matrix, construction of classifiers, and classifier performance evaluation. The feature selection methods will directly affect the subsequent classification effect, so the improvement of the existing feature selection method is worthy of further study. Therefore, in view of the shortcomings of feature selection, this paper introduces the importance of the location of feature words in the text, the frequency of inter-class words, the frequency of intracategory words, the degree of concentration between classes, and the intra-class dispersion. We improved the better feature selection methods: chi-square verification method and the expected cross-entropy, and proposed a text feature selection method based on multiple factors. The experimental results show that the proposed feature selection method actually can improve the classification accuracy rate, and it is more stable than other methods in the imbalanced document classification. The experiments also show that whether it is a balanced document or an imbalanced document, the feature selection method proposed in this study has a significant improvement on the classification accuracy compared with the traditional methods and other methods. |
主题分类 |
基礎與應用科學 >
資訊科學 基礎與應用科學 > 統計 社會科學 > 管理學 |
参考文献 |
|