


Exploring Automatic Document Classification of Environmental Education Research Papers Using Text Mining Manners: An Analysis on Abstracts from the International Conference on Environmental Education between 2013-2018




張益誠(I-Cheng Chang);張育傑(Yu-Jie Chang);余泰毅(Tai-Yi Yu)


二階段集群分析 ; 文字探勘 ; 文字雲 ; 共詞分析 ; 關聯規則分析 ; two-step cluster analysis ; text mining ; word cloud ; co-word analysis ; association rules




17卷1期(2021 / 06 / 01)


85 - 128




本研究收集中華民國環境教育學會歷年舉辦的環境教育研討會論文摘要,透過文件自動分類技術,探討環境教育領域文章的詞彙特色與分類的一致性,運用的技術涵括自然語言處理、二階段集群分析、文字雲、共詞分析與關聯規則分析。本文將研討會論文摘要導入中研院中文詞知識庫之自然語言處理演算法,進行斷詞處理,期間採用環境教育專家意見進行輔助斷詞文字處理,將語料庫製成量化的TF-IDF(Term Frequency-Inverse Document Frequency,詞頻-反向文件頻率)結構化樣式。應用二階段集群分析技術對於TF-IDF權重矩陣進行文章自動分類,同時運用文字雲、共詞分析與關聯規則分析,展現類別文章的詞彙特色以及勾稽分類文章的一致性。透過2013-2018年的561篇研討會論文摘要結果發現,斷詞後的原始關鍵詞彙共計4980個,前500大(10%)詞彙可以解釋74.1%的累積詞頻,TF-IDF權重對於篩選環境教育專業詞彙的篩選,可以符合關鍵少數原則。分析階層式集群分析法的總殘差下降幅度,決定K-means集群數目為六類,與歷史文獻比對環境教育的主題,各集群文件的主題分類為:(1)環境政策法規;(2)永續發展;(3)環境倫理、能源資源永續利用;(4)災害防救、能源資源永續利用;(5)氣候變遷;(6)環境倫理。本研究運用文字雲列出各類別高TF-IDF權重的詞彙、文章數量及其比例;採用勾稽方式評估環境教育主題分類的一致性,列出各類別最小距離的前三名文章題目、關鍵詞以及距離,發現各類別的文章主題的確符合一致性。此外,依據分類結果進行Web圖的繪製,篩選重要關鍵詞彙以及其關聯規則,進而建議不同環境教育主題類別的重要關鍵詞彙。對於環境教育領域的自然語言斷詞處理程序以及自動文件分類勾稽而言,必須仰賴領域專家協助,方可提供正確與一致的斷詞與分類結果。


This research collects abstracts from the International Conference on Environmental Education Academia and Practices held by the Chinese Society for Environmental Education (CSEE) between 2013-2018. Through the automatic topic classification techniques, it explores the vocabulary characteristics of classified articles in the field of environmental education and the consistency of classification. Techniques applied include natural language processing, two-step cluster analysis, word cloud, co-word analysis and association rules analysis. In this study, the research abstracts from the conference papers have been imported into the natural language processing algorithm of the CKIP Chinese Lexical Knowledge Base of Academia Sinica for word segmentation. The opinions of environmental education experts have been applied for auxiliary word segmentation, and corpora of abstracts from conference papers have been made into quantitative Term Frequency-Inverse Document Frequency (TF-IDF) weights. Afterwards, two-step cluster analysis technology has been performed to automatically classify articles clusters; the techniques of word cloud, co-word analysis and association rule analysis have been used to show the vocabulary characteristics of distinct clustered articles and the consistency of the classified articles. Based on the results of 561 abstracts of conference papers from 2013 to 2018, the number of original keywords after word segmentation is 4,980. The top 500 (10%) words account for the 74.1% of the cumulative word frequency. The selection of professional vocabularies can match the Pareto principle. The two-step cluster analysis classifies the number of K-means clusters into six categories, namely (1) environmental policy and regulation; (2) sustainable development; (3) environmental ethics and sustainable use of energy and resources; (4) disaster prevention and response, sustainable use of energy and resources; (5) climate change; (6) environmental ethics. This study applies the word cloud to enlist the dominant words with high TF-IDF weights, word frequency and proportions for distinct clusters; utilizes the cross-check method to assess the consistency of topic classification and enlists the top three article titles and keywords with the smallest distance in each category. In addition, the web map is drawn in accordance with classification results, and dominant keywords and their association rules are screened, and then dominant keywords of different themes have been suggested. For natural language word segmentation process and automatic document classification in topic modeling, the assistance of domain experts for environmental education plays a crucial role in providing correctness and consistence in aforementioned academic tasks.

主题分类 工程學 > 市政與環境工程
社會科學 > 教育學
