英文摘要
|
Purpose-To investigate the feasibility of applying Latent Dirichlet Allocation (LDA) to a large number of Chinese ancient poems. This study explores word usages, the connotation of poems, the topical association between poems, and to observe the changes in words between different dynasties. Design/methodology/approach - Since term segmentation techniques of vernacular are often inadequate for classical Chinese poetry, this study proposes two methods - Chinese Syntactic Chain Processing (CSCP) and the Chinese Classic Poetic Formula (CCPF), to process poetry segmentation. The experimental material was collected from "The Complete Tang Poetry" and "The Complete Song Poems", totaling 204,633 pieces, constructing the word bag of the LDA, and then implementing CSCPLDA and CCPF-LDA, producing four kinds of Tang, Song Dynasty topic model. All topics were estimated and inferred using Gibbs Sampling, and the parameters were chosen using the preset values of α = 0.5, β = 0.1. The perplexity value is calculated and determined 110 as the LDA topic number, 600 as the iteration number. Findings-The research result observes that even though the number of Tang poetry is much less than that of Song poetry, the number of unique words identified is more than that of Song poetry, indicating that Tang poetry is more pluralistic, lively and diversified; Song poetry tends to be conservative and cautious. The experimental results show that the correct rate of segmented word by CSCP is not as good as CCPF, but the evaluation of UMass Topic Coherence and experts indicates that the generated poetic theme of CSCP-LDA is better than that of CCPF-LDA. Research limitations/implications - Although the correct rate of word segmentation of CCPF is effective, it cannot be applied to non-regulated verse poems, and the CCPF-LDA classification effect is not as good as CSCP-LDA. Future research is recommended to explore ancient poetry classification by using other approach, such as deep neural network approach. Practical implications -Although literati distinguish the poets and poetry in different styles, the rules of the distinction are not obvious and generally recognized; therefore, it is difficult to generate the rules for the classification of poetry from critics' comments or from poetic writing alone. To our best knowledge, the CSCP is the first of its kind to analyze ancient poetry not relying on the rules of classical Chinese regulated verse. This study is also the only one applying LDA to analyze the meaning of verses. With the promising results of topic modeling of this study suggests that the traditional vernacular word segmentation method and the removal of single character are not suitable for the word processing of ancient poetry. Originality/value - We proposed a new poetry segmentation method. The fundamental idea of building CSCP is a bottom-up concatenating process based on the intensity and significance degree of distribution rate to extract meaningful descriptors from a string by processing the direct link and the inverted link in parallel. The process will be iterated until no concatenation can be found.
|
参考文献
|
-
Xue, N.(2003).Chinese Word Segmentation as Character Tagging.International Journal of Computational Linguistics and Chinese,8(1),29-48.
連結:
-
羅鳳珠(2011)。植基於中國詩詞語言特性所建構之語意概念分類體系研究。圖書與資訊學刊,78,63-86。
連結:
-
羅鳳珠(2011)。以語言知識庫為基礎的智慧型作詩填詞輔助系統。教學科技與媒體,95,36-52。
連結:
-
Ageishi, R.,Miura, T.(2008).Named entity recognition based on a Hidden Markov Model in part-of-speech tagging.2008 First International Conference on the Applications of Digital Information and Web Technologies (ICADIWT)
-
Asahara, M.,Goh, C.L.,Wang, X.,Matsumoto, Y.(2003).Combining Segmenter and Chunker for Chinese Word Segmentation.Proceedings of Second SIGHAN Workshop on Chinese Language Processing
-
Barzilay, R.,Elhadad, M.(1999).Using lexical chains for text summarization.Advances in automatic text summarization,111-121.
-
Blei, D.M.,Ng, A.Y.,Jordan, M.I.(2003).Latent dirichlet allocation.Journal of machine Learning research,3(Jan),993-1022.
-
Chang, J.,Boyd-Graber, J.,Wang, C.,Gerrish, S.,Blei, D. M.(2009).Reading tea leaves: How humans interpret topic models.Advances in Neural Information Processing Systems,Vancouver, British Columbia:
-
Chang, J.,Gerrish, S.,Wang, C.,Boyd-Graber, J.L.,Blei, D.M.(2009).Reading tea leaves: How humans interpret topic models.Advances in neural information processing systems
-
Chen, Z.,Mukherjee, A.,Liu, B.,Hsu, M.,Castellanos, M.,Ghosh, R.(2013).Discovering coherent topics using general knowledge.Proceedings of the 22nd ACM international conference on Information & Knowledge Management
-
Chiong, R.,Wei, W.(2006).Named entity recognition using hybrid machine learning approach.2006 5th IEEE International Conference on Cognitive Informatics
-
Deerwester, S.,Dumais, S.T.,Furnas, G.W.,Landauer, T.K.,Harshman, R.(1990).Indexing by latent semantic analysis.Journal of the American society for information science,41(6),391-407.
-
Gao, J.,Zhang, J.(2003).Sparsification strategies in latent semantic indexing.Proceedings of the 2003 Text Mining Workshop
-
Hofmann, T.(1999).Probabilistic latent semantic indexing’, Paper presented at the Proceedings of the 22nd annual international ACM.ACM SIGIR Forum,51(2),211-218.
-
Huang, C.-M.(2014).Applying A Lightweight Chinese Lexical Chain Processing In Web Image Annotation.Proceedings of the International Conference on Image Processing, Computer Vision, and Pattern Recognition (IPCV)
-
Huang, C.-M.,Chang, Y.-J.(2013).Applying a lightweight iterative merging Chinese segmentation in web image annotation.International Workshop on Machine Learning and Data Mining in Pattern Recognition
-
Huang, C.-M.,Chang, Y.-J.(2013).Applying a Lightweight Iterative Merging Chinese Segmentation in Web Image Annotation.Lecture notes in computer science,7988,183-194.
-
Huang, C.-M.,Wu, C.-Y.(2015).Effects of Word Assignment in LDA for News Topic Discovery.The 4th International Congress on Big Data,New York, U.S.A:
-
Jim Barnett, K.K.,Mani, Inderjeet,Rich, Elaine(1990).Natural Language Processing.Communication of the ACM,33(8),50-71.
-
Lau, J.H.,Newman, D.,Baldwin, T.(2014).Machine Reading Tea Leaves: Automatically Evaluating Topic Coherence and Topic Model Quality.Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics
-
Liddy, E.D.(1990).Anaphora in Natural Language Processing and Information retrieval.Information Processing & Management,26(1),39-52.
-
Mimno, D.,Wallach, H.M.,Talley, E.,Leenders, M.,McCallum, A.(2011).Optimizing semantic coherence in topic models.Proceedings of the Conference on Empirical Methods in Natural Language Processing
-
Mimno, H.W.,Talley, E.,Leenders, M.,McCallum, A.(2011).Optimizing semantic coherence in topic models.Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing (EMNLP 2011),Edinburgh, UK:
-
Morris, J.,Hirst, G.(1991).Lexical cohesion computed by thesaural relations as an indicator of the structure of text.Computational Linguistics,17(1),21-48.
-
Newman, D.,Lau, J.H.,Grieser, K.,Baldwin, T.(2010).Automatic evaluation of topic coherence.Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics
-
Quercia, D.,Askham, H.,Crowcroft, J.(2012).TweetLDA: supervised topic classification and link prediction in Twitter.Proceedings of the 4th Annual ACM Web Science Conference
-
Séaghdha, D.O.,Korhonen, A.(2014).Probabilistic distributional semantics with latent variable models.Computational linguistics,40(3),587-631.
-
Tosa, N.,Obara, H.,Minoh, M.(2008).Hitch haiku: An interactive supporting system for composing haiku poem.International Conference on Entertainment Computing. Entertainment Computing-ICEC
-
Tseng, H.,Chang, P.,Andrew, G.,Jurafsky, D.,Manning, C.(2005).A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005.Proceedings of Fourth SIGHAN Workshop on Chinese Language Processing
-
Wallach, H. M.(2006).Topic modeling: beyond bag-of-words.Proceedings of the 23rd international conference on Machine learning,Pittsburgh, Pennsylvania, USA:
-
Wang, Z.,He, W.,Wu, H.,Wu, H.,Li, W.,Wang, H.,Chen, E.(2016).,未出版
-
Xie, P.,Xing, E.P.(2013).,未出版
-
Yan, R.(2016).i, poet: Automatic poetry composition through recurrent neural networks with iterative polishing schema.Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16)
-
Yi, Y.,He, Z.-S.,Li, L.-Y.,Yu, T.,Yi, E.(2005).Advanced studies on traditional Chinese poetry style identification.2005 International Conference on Machine Learning and Cybernetics
-
王力(2002).詩詞格律.中華書局(香港)有限公司.
-
王迺仁,曾憲雄,楊哲青,蘇俊銘,羅鳳珠(2005)。詩風規則之研究-以唐朝近體詩為例。第二屆文學與資訊科技國際研討會
-
馮時,景珊,楊卓與,王大玲(2013)。基於 LDA 模型的中文微博話題意見領袖挖掘。東北大學學報:自然科學版,34(4),490-494。
-
劉文蔚(1932),詩學含英,錦章圖書局。
-
蔣銳瀅,崔磊,何晶,周明,潘志庚(2015)。基於主題模型和統計機器翻譯方法的中文格律詩自動生成。電腦學報,38(12),2426-2436。
-
羅鳳珠(2005)。詩詞語言詞彙切分與語意分類標記之系統設計與應用。第四屆數位典藏技術研討會
-
羅鳳珠(編)(2004).語言、文學與資訊.新竹:國立清華大學出版社.
-
羅鳳珠,李元萍,曹偉政(1999)。中國古代詩詞格律自動檢索與教學系統。中文資訊學報,13(1),36-43。
-
羅鳳珠,張智星,許介彥(2007)。植基於語意學及使用者認知觀點的資訊檢索系統設計:以全唐詩網站為例。第三屆文學與資訊科技國際研討會,日本:
-
羅鳳珠,曹偉政(2008)。唐宋詞單字領字研究。語言暨語言學,9(2),189-220。
|