题名

詞嵌入應用於佛學研究—兼論詞嵌入模型評估

并列篇名

Word Embedding in Buddhist Studies: On the Basis of Evaluation of Word Embedding Models

DOI

10.6853/DADH.202310_(12).0003

作者

黃淑齡(Shu-Ling Huang);王昱鈞(Yu-Chun Wang)

关键词

詞嵌入 ; 漢文大藏經 ; 佛學研究 ; 語義關係 ; 語義類比 ; word embedding ; Chinese Tripitaka (CBETA) ; Buddhist studies ; word relations ; word analogy

期刊名称

數位典藏與數位人文

卷期/出版年月

12期(2023 / 10 / 01)

页次

43 - 82

内容语文

繁體中文;英文

中文摘要

詞嵌入是利用語料庫自動產生語義向量的方法,本論文的目標為探索詞嵌入在Comprehensive Buddhist Electronic Text Archive(CBETA)漢文佛典中的可能應用面向。為取得適用於佛學研究的詞嵌入最佳模型,本文利用莊春江辭典、丁福保辭典和Digital Dictionary of Buddhism辭典建立實驗資料集,並設計偵測同義詞及干擾詞等兩種評估實驗來取得模型優化的基線。結果發現Word2Vec CBOW(continuous bag-of-words)、Dimension 400、Window 10、Epoch 10為最佳超參數組合,驗證正確率為0.87,測試正確率為0.86。據此,我們將CBETA語料分類訓練出不同詞嵌入模型,再跑出依據年代、譯者及部類的不同範圍語料對比詞表,並進行實際應用分析。本論文的主要貢獻有三:一、建置適用於漢文佛典研究之詞嵌入同義詞資料集;二、找出適於漢文佛典文本之詞嵌入超參數;三、探討與分析詞嵌入於漢文佛典研究之實例,包括可用於判斷譯詞的語義核心演變、能用於界定不明確的語義、能透過語義類比找出相關概念、能找出各部類的核心概念、能藉以拓展研究廣度和深度,以及可用於驗證傳統研究結果等面向。

英文摘要

Word embedding is a method to automatically generate semantic vectors using corpora. This paper aims to explore the possible applications of word embedding in the Chinese Buddhist database (Comprehensive Buddhist Electronic Text Archive, CBETA). In order to obtain the best model of word embedding for Buddhist studies, we compile an experiment dataset using Chunjiang Zhuang's dictionary, Fubao Ding's dictionary, and Digital Dictionary of Buddhism dictionary; and designs two evaluation experiments for detecting synonyms and outlier words to obtain a baseline for model optimization. It is found that Word2vec CBOW, Dimension 400, Window 10, Epoch 10 is the best set of parameters. The validation score is 0.87 and the test score is 0.86. Accordingly, we categorize the CBETA corpus to train different models; and then run comparative word lists for different chronologies, translators, and schools of Buddhism; then further demonstrated the applications in real cases. The main contribution of this paper is threefold: 1. to build a synonym collection for word embedding used in the study of Chinese Buddhism; 2. to identify the hyper-parameters of word embedding for the study of Chinese Buddhism; 3. to explore and demonstrate the results of word embedding in the Chinese Buddhist studies, including the ability to determine the semantic core evolution of translated words, to define new words, to identify related concepts through semantic analogy, to identify the core concepts of each school, and to expand the scope of researches. In addition, it can be used to verify the results of traditional research.

主题分类 人文學 > 人文學綜合
基礎與應用科學 > 資訊科學
参考文献
  1. 朱慶之(2019)。從平行梵本看支譯《維摩詰經.菩薩品》所謂「『是』後置特殊判斷句」的真實句法語義結構。佛光學報,5(2),39-76。
    連結:
  2. 林昆賢,蔡俊明(2019)。基於深度學習的自然語言處理中預訓練Word2Vec 模型的研究。國教新知,66(1),15-31。
    連結:
  3. 曾元顯,許瑋倫,吳玟萱,古怡巧,陳學志(2020)。基於檢索方法的中文幽默對話系統之建置應用與評估。圖書資訊學刊,18(2),73-101。
    連結:
  4. 黃泰霖,宋傳欽,姜志銘,譚克平,高桂惠(2019)。唐詩流通度之探討。中國統計學報,57,263-285。
    連結:
  5. 謝吉隆,楊苾淳(2018)。從「應變自然」到「社會應變」:以文字探勘方法檢視國內風災新聞的報導演變。教育資料與圖書館學,55,285-318。
    連結:
  6. Bengio, Y.,Ducharme, R.,Vincent, P.,Jauvin, C.(2003).A neural probabilistic language model.Journal of Machine Learning Reseach,3,1137-1155.
  7. Bjerva, J.,Praet, R.(2015).Word embeddings pointing the way for late antiquity.Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH),Beijing, China:
  8. Burns, P. J.,Brofos, J. A.,Li, K.,Chaudhuri, P.,Dexter, J. P.(2021).Profiling of intertextuality in Latin literature using word embeddings.Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Stroudsburg, PA:
  9. Ganlantree(2007 年 10 月 26 日)。哈工大《同義詞詞林》共用版的若干改進﹝部落格文章﹞。取自 https://blog.csdn.net/ganlantree/article/details/1845788
  10. Gonen, H.,Goldberg, Y.(2019).Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them.Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,Minneapolis, MN:
  11. Hamilton, W. L.,Leskovec, J.,Jurafsky, D.(2016).Diachronic word embeddings reveal statistical laws of semantic change.Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics,Berlin, Germany:
  12. Hayles, N.(2012).How we think: Digital media and contemporary technogenesis.Chicago, IL:University of Chicago Press.
  13. Hengchen, S.,Ros, R.,Marjanen, J.,Tolonen, M.(2021).A data-driven approach to studying changing vocabularies in historical newspaper collections.Digital Scholarship in the Humanities,36(Suppl. 2),ii109-ii126.
  14. Hu, C.,Zhao, B.(2021).Movie recommendation system based on deep learning.International Core Journal of Engineering,7(9),289-296.
  15. Kamlovskaya, E. (2018). Word embeddings in humanities. Retrieved from https://dhh.uni.lu/2018/12/11/word-embeddings-in-humanities/
  16. Kutuzov, A.,Øvrelid, L.,Szymanski, T.,Velldal, E.(2018).Diachronic word embeddings and semantic shifts: A survey.Proceedings of the 27th International Conference on Computational Linguistics,Santa Fe, NM:
  17. Le, Q.,Mikolov, T.(2014).Distributed representations of sentences and documents.Proceedings of Machine Learning Research,32,1188-1196.
  18. Leavy, S.,Wade, K.,Meaney, G.,Greene, D.(2018).Navigating literary text with word embeddings and semantic lexicons.Workshop on Computational Methods in the Humanities 2018,Luasanne, Switzerland:
  19. Levy, O.,Goldberg, Y.,Dagan, I.(2015).Improving distributional similarity with lessons learned from word embeddings.Transactions of the Association for Computational Linguistics,3,211-225.
  20. Mikolov, T.,Sutskever, I.,Chen, K.,Corrado, G. S.,Dean, J.(2013).Distributed representations of words and phrases and their compositionality.Advances in neural information processing systems,New York, NY:
  21. Nissim, M.,van Noord, R.,van der Goot, R.(2020).Fair is better than sensational: Man is to doctor as woman is to doctor.Computational Linguistics,46,487-497.
  22. Saeed, J. I.(2009).Semantics.Oxford, UK:Wiley-Blackwell.
  23. Schmidt, B. (2015, October 25). Vector space models for the digital humanities [Blog post]. Retrieved from http://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html
  24. Schnabel, T.,Labutov, I.,Mimno, D.,Joachims, T.(2015).Evaluation methods for unsupervised word embeddings.Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,Lisbon, Portugal:
  25. Sculley, D.,Pasanek, B. M.(2008).Meaning and mining: The impact of implicit assumptions in data mining for the humanities.Literary and Linguistic Computing,23,409-424.
  26. Sprugnoli, R.,Passarotti, M.,Moretti, G.(2019).Vir is to moderatus as mulier is to intemperans: Lemma embeddings for Latin.Proceedings of the Sixth Italian Conference on Computational Linguistics,Bari, Italy:
  27. Taylor, J. R.(2003).Near synonyms as co-extensive categories: "High" and "tall" revisited.Language Sciences,25,263-284.
  28. Wang, B.,Wang, A.,Chen, F.,Wang, Y.,Kuo, C.-C. J.(2019).Evaluating word embedding models: Methods and experimental results.APSIPA Transactions on Signal and Information Processing,8,E19.
  29. Wang, Y.-C.(2020).Word segmentation for classical Chinese Buddhist literature.Journal of the Japanese Association for Digital Humanities,5(2),154-172.
  30. Wohlgenannt, G.,Chernyak, E.,Ilvovsky, D.(2016).Extracting social networks from literary text with word embedding tools.Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH),Osaka, Japan:
  31. 亢世勇(編)(2015).新編同義詞詞林.上海:上海辭書出版社.
  32. 王冰(2011)。三十年來國內漢譯佛經詞彙研究述評。華夏文化論壇,2011(6),169-174。
  33. 李維琦(2003)。考釋佛經中疑難詞語例說。湖南師範大學社會科學學報,32(4),121-125。
  34. 辛島靜志,徐文堪(譯)(2007)。早期漢譯佛教經典所依據的語言。漢語史研究集刊,成都:
  35. 竺家寧(2006)。佛經語言研究綜述—詞彙篇。佛教圖書館館刊,44,66-86。
  36. 竺家寧(1998)。認識佛經的一條新途徑:談談「佛經語言學」。香光莊嚴,55,6-13。
  37. 侯坤宏,卓遵宏(2014).六十感恩紀—惠敏法師訪談錄.臺北:國史館.
  38. 高婉瑜(2014)。漢文佛典「一旦」的詞類與演變問題。漢譯佛典語言研究,北京:
  39. 張簡宇傑(2020)。新竹,國立清華大學工業工程與工程管理學系。
  40. 梁啟超(1998).佛學研究十八篇.臺北:臺灣中華書局.
  41. 梅家駒(編),竺一鳴(編),高蘊琦(編),殷鴻翔(編)(1983).同義詞詞林.上海:上海辭書出版社.
  42. 陳克威(2020)。花蓮,國立東華大學資訊工程學系。
  43. 陳秀蘭(2018).基於梵漢對勘的魏晉南北朝佛經詞彙語法研究.上海:復旦大學出版社.
  44. 陳思澄,洪孝宗,陳柏琳(2015)。使用詞向量表示與概念資訊於中文大詞彙連續語音辨識之語言模型調適。The 2015 Conference on Computational Linguistics and Speech Processing,新竹,臺灣:
  45. 陳淑庭(2021)。臺中,東海大學中國文學研究所。
  46. 陳鳳櫻(2021)。新竹,國立清華大學語言學研究所。
  47. 曾千蕙(2018)。臺北,國立臺灣大學資訊管理學研究所。
  48. 曾昭聰(2009)。佛典文獻詞彙研究的現狀與展望。佛教圖書館館刊,50,58-65。
  49. 曾昭聰(2005)。中古佛經詞義抉要。咸陽師範學院學報,20(1),69-72。
  50. 詹麒正(2020)。新竹,國立交通大學資訊學院資訊學程。
  51. 羅文君(2019)。臺北,國立臺灣大學工程科學及海洋工程學研究所。
  52. 釋慈怡(編)(1990).佛光大辭典.北京:北京圖書館出版社.
被引用次数
  1. (2024)。深度學習方法在中國佛教經典目錄分類中的應用。圖書資訊學刊,22(1),133-164。