题名

以維基百科為基礎之中文縮寫詞與同義詞庫建構

并列篇名

Wikipedia-based Chinese Abbreviation and Synonym Construction

作者

黃純敏(Chuen-Min Huang);李亞哲(Ya-Che Li);陳柏宏(Po-Hung Chen)

关键词

同義詞 ; 縮寫詞 ; 概括縮詞 ; 維基百科 ; 同形異義詞 ; synonym ; abbreviation ; generalized term ; Wikipedia ; polysemy

期刊名称

資訊管理學報

卷期/出版年月

22卷2期(2015 / 04 / 01)

页次

117 - 140

内容语文

繁體中文

中文摘要

雖然過去對於辨識縮寫詞已有不少研究,但其研究範圍並未包含概括縮詞,此外,面對不斷增長及變化的詞彙,已成為資訊檢索及詞庫維護最大的問題。有別於過去以統計方式處理,本研究以維基百科的內文組成結構為基礎,提出數項創新且輕量級同義詞配對識別法。由於同義詞並沒有絕對客觀的標準答案可資核對,為驗證本研究所提出方法是否有效,我們進行兩階段包含主客觀方式評量。實驗結果顯示本研究所提出的方法,除了能有效萃取出縮寫詞、異形同義及同形異義詞之外,還能夠識別出過去研究無法解決的概括縮詞。在第一階段評量平均精確率為72%、召回率82%,其中縮寫詞的精確率高達92%,概括縮詞的召回率為90%。第二階段評量結果,使用者接受度亦達91%。在效率方面,平均找出1組同義詞只需要0.01秒。

英文摘要

Purpose-A synonym can be any part of speech with the same or similar meaning of another word. Broadly speaking, it covers abbreviations in its scope. By convention, authors tend to indicate their writing with high artistic qualities by using numerous synonyms in context. Due to the interchangeable feature and the rampant growth of new usages, synonyms increase the difficulty of Natural Language Processing (NLP) and vocabulary maintenance. Unlike traditional approaches failed in its fallacy outcomes due to the adoption of statistical methods to determine synonyms, this study aims to construct a comprehensive synonym database via lightweight methods which would also take update issue into serious consideration. Design/methodology/approach-The study proposes a research framework based on the analysis of contextual structure of Wikipedia. Due to the lack of a recognized correct corpus to assess synonyms, we adopted a two-stage evaluation including subjective and objective ways. Taken the virtue of continuous user involvement and suggestion, the constructed synonym database will be synchronously updated accordingly. Findings-The proposed methods not only can correctly identify abbreviations, synonyms, and homographs, it can also successfully extract generalized terms with its multinomial sub-terms which had never done before. This finding indicates that a greater deployment of the comma algorithm can be undertaken to other customized application. The precision and recall rates of the first-stage evaluation are 72% and 82%, respectively. The user acceptance rate conducted in the second-stage reaching 91% was very promising. As for the efficiency evaluation, it took only 0.01 seconds to extract one set of synonyms from the system. Research limitations/implications-This study mainly focused on formal descriptions extracted from Wikipedia. It is suggested that future research may consider applying to confusion word set or social media to fill the gap. Practical implications-This paper contributes to automatic synonym construction research in several ways with a couple of practical implications. First, it demonstrates that a statistics-free, lightweight method can effectively generate a comprehensive coverage of synonyms. Second, this method can work with search engines to conduct big data analysis. Third, this study depicts that synonym construction can be portrayed in terms of ontology architecture to guarantee the sustainability of knowledge and the growth of literacy competencies of users. Originality/value-Even though there have been many researches towards synonyms, none of them proposed the resolution to identify the generalized term with its multinomial sub-terms. This study is the first of its kind to solve this problem. In addition, words will be labeled with their name entity such as names of people, places, and organizations. Search results will be displayed based on the ontology architecture in which the word association can be clearly visualized.

主题分类 基礎與應用科學 > 資訊科學
社會科學 > 管理學
参考文献
  1. Lin, C.J.,Zhan, J.C.,Chen, Y.H.,Pao, C.W.(2012).Strategies of processing Japanese names and variant characters in traditional Chinese text.Computational Linguistics and Chinese Language Processing,17(3),87-108.
    連結:
  2. Chang, J.S.,Lai, Y.T.(2004).A preliminary study on probabilistic models for Chinese abbreviations.Proceedings of the Third SIGHAN Workshop on Chinese Language Processing (SIGHAN 2004),Barcelona, Spain:
  3. Chang, J.S.,Teng, W.l.(2006).Mining atomic Chinese abbreviation pairs with a probabilistic single character word recovery model.Proceedings of the fifth SIGHAN Workshop on Chinese Language Processing (SIGHAN 2006),Sydney, Australia:
  4. Fu, M.H.,Peng, C.H.,Kuo, Y.H.,Lee, K.R.(2012).Hidden community detection based on microblog by opinion-consistent analysis.Proceedings of 2012 International Conference on Information Society (i-Society),London, UK:
  5. Hong, C.M.,Chen, C.M.,Chiu, C.Y.(2009).Automatic extraction of new words based on Google News corpora for supporting lexicon-based Chinese word segmentation systems.Expert Systems with Applications,36(2),3641-3651.
  6. Huang, C.M.,Yang, C.P.(2005).Chinese Abbreviations and Expansion.Proceedings of the National Computer Symposium (NCS 2005),Tainan, Taiwan:
  7. Kang, S.S.,Hwang, K.B.(2006).A language independent n-gram model for word segmentation.Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence (AUS-AI 2006),Hobart, Australia:
  8. Kit, C.,Xu, Z.,Webster, J.J.(2003).Integrating ngram model and case-based learning for Chinese word segmentation.Proceedings of the second SIGHAN workshop on Chinese language processing (SIGHAN 2003),Sapporo, Japan:
  9. Li, Z.,Yarowsky, D.(2008).Unsupervised translation induction for Chinese abbreviations using monolingual corpora.Proceedings of the Association for Computational Linguistics (ACL-2008),Clumbus, Ohio:
  10. Lu, Y.,Zhang, C.,Hou, H.(2009).Using Multiple Hybrid Strategies to Extract Chinese Synonyms from Encyclopedia Resource.Proceedings of the 2009 Fourth International Conference on Innovative Computing, Information and Control (ICICIC 2009),Kaohsiung, Taiwan:
  11. Nguyen, H.T.,Cao, T.H.(2008).Named entity disambiguation on an ontology enriched by Wikipedia.Proceedings of the IEEE International Conference on Research, Innovation and Vision for the Future (RIVF 2008),Ho Chi Minh City, Vietnam:
  12. Tsai, R.T.H.(2010).Chinese text segmentation: A hybrid approach using transductive learning and statistical association measures.Expert Systems with Applications,37(5),3553-3560.
  13. Zhang, J.,Nie, J.Y.,Gao, J.,Ming, Z.(2000).On the use of words and N-grams for Chinese information retrieval.Proceedings of the fifth International Workshop on Information Retrieval with Asian languages (IRAL 2000),Hong Kong, China:
  14. Zhang, K.,Liu, Q.,Zhang, H.,Cheng, X.Q.(2002).Automatic recognition of Chinese unknown words based on roles tagging.Proceedings of the first SIGHAN workshop on Chinese language processing (SIGHAN 2002),Taipie, Taiwan:
  15. 梅家駒、竺一鳴、高蘊琦、殷鴻翔(1982)。同義詞詞林。上海辭書出版社。
  16. 陳良駒、陳日鑫(2010)。植基於詞彙數量關係探討軍事新聞主題─以青年日報為例。資訊管理展望,12(1),21-42。
  17. 馮志偉(2009)。語義互聯網與辭書編纂。暨南大學学华文学院学报,4(4),88-94。
  18. 黃純敏、石朝元、張精哲(2007)。中文縮寫詞延伸研究。第十八屆國際資訊管理學術研討會論文集(ICIM 2007),台灣:
  19. 黃純敏、蕭明華(2012)。改進中文縮寫詞與原形詞配對率。第十八屆兩岸資訊發展高峰論壇(CSIM 2012),臺灣:
被引用次数
  1. 何篤光,石偉源(2020)。應用文字探勘法分析大數據時代體育研究發展趨勢。體育學報,53(4),439-451。
  2. (2017)。基於詞性組合規則結合維基百科進行中文命名實體辨識與消歧義。圖書資訊學研究,11(2),139-179。