题名

人工智慧在中文歷史文獻判讀領域應用初探:以國立故宮博物院典藏為例

并列篇名

A Preliminary Study on the Application of Artificial Intelligence in the Interpretation of Chinese Historical Documents: A Case Study of National Palace Museum Collection

作者

黃宇暘(Yu-Yang Huang);郭鎮武(Chen-Wo Kuo);周維強(Wei-Qiang Zhou);林國平(Quo-Ping Lin);蔡瑞煌(Rua-Huan Tsaih)

关键词

清史研究 ; 人工智慧 ; 字符識別 ; 文獻數位化 ; 國立故宮博物院 ; Studies in Qing History ; Artificial intelligence ; Character Recognition ; Digitalizing Historical Documents ; National Palace Museum

期刊名称

科技博物

卷期/出版年月

25卷3期(2021 / 09 / 01)

页次

5 - 23

内容语文

繁體中文

中文摘要

國立故宮博物院(以下簡稱故宮)擁有近七十萬件文物典藏,其龐大的數量不僅對數位化工作而言是莫大的挑戰,後續的解讀應用對研究者而言亦為艱難的門檻。自2017年起,圖書文獻處數位典藏科承接「圖書文獻高解析重點項目數位化子計畫」,計畫完成近四十萬頁數位檔,加上歷年完成的數位檔,已有近二百四十萬頁。漫長的工作時程與龐大的數位化資產,促使筆者開始思考如何利用新科技優化已完成數位掃描文獻的加值應用。而文獻數位化重要的第一步,在於建立全文檢索。建立數位掃描影像已屬曠日廢時,以人工辨識內容更是耗費資源。為此計畫引入人工智慧科技,在掃描圖檔的同時,進行文字辨識與元資料輔助分類,以加快數位化之進程。更可為後續加值應用預留各種可能性,如將文獻中的地理資訊對接GIS(Geographic Information System)系統,方便以地名檢索所有清檔、奏摺;或是將文獻內涉及人物自動對接清代檔案人名權威檔資料庫,並標定其時任官銜,自動建立同地緣關係或交遊網路,大幅增加從事清史研究者之便利。雖目前人工智慧尚難以直接完美辨識並標點文獻,然而學術界已有部分案例探討,本文亦在此一基礎上稍做抒發,期能拋磚引玉,促進院藏清史文獻數位化的進程。

英文摘要

National Palace Museum (NPM) obtains nearly 700,000 world-class extensive art collections, of which the large quantity is not only a great challenge for digitization, but also a high threshold for researchers on subsequent interpretation and application. Ever since 2017, Department of Rare Books and Historical Documents submitted the "Subordinate Program of Digitalizing Crucial Historical Documents in High Resolutions" to bid for the Executive Yuan's Forward-looking Infrastructure Development Program. Based upon the idea above, the department's main goal was to digitize at least 400,000 pages, which adds up to nearly 2.4 million pages of digital files over the years. The long working hours and large digital assets have prompted us to think about ways to leverage new technologies and optimize the value-added applications of completed digital scans. One of the major milestones in digitizing documents is the creation of full-text searches. Since this is a resource-intensive and time-consuming task to accomplish manually, full-text retrieval is even more unattainable when digital scanning is long overdue. In order to do so, the artificial intelligence technology has been introduced to perform text recognition and metadata auxiliary classification with digital scans to speed up the process of digitization, so that there may be more possibilities for subsequent value-added applications, such as connecting geographic data in the literature to the GIS (Geographic Information System) to facilitate the retrieval of all Qing Dynasty archives by geographical locations; or automatically linking the characters in the literature to names in the Qing Dynasty archive database, automatically establish geopolitical relations or networking to their titles, making it more convenient for the researchers of Qing History. Although it is still difficult to perfectly identify and punctuate literature directly with artificial intelligence, there are a number of case studies in the academic world, and this paper will also provide some insights on this basis, in the hope that it can facilitate the process of digitization of literature.

主题分类 人文學 > 藝術
社會科學 > 教育學
社會科學 > 管理學
参考文献
  1. Buchanan, B. G.(2005).A (very) brief history of artificial intelligence.Ai Magazine,26(4),53.
  2. Cai, D.,Zhao, H.,Zhang, Z.,Xin, Y.,Wu, Y.,Huang, F.(2017).Fast and accurate neural word segmentation for Chinese.Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
  3. Chen, J.,Cao, H.,Natarajan, P.(2015).Integrating natural language processing with image document analysis: What we learned from two real-world applications.International Journal on Document Analysis and Recognition,18(3),235-247.
  4. Chen, X.,Shi, Z.,Qiu, X.,Huang, X.(2017).Adversarial multi-criteria learning for Chinese word segmentation.Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
  5. Emerson, T.(2005).The second international Chinese word segmentation bakeoff.Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing
  6. Han, X.,Wang, H.,Zhang, S.,Fu, Q.,Liu, J. S.(2018).,未出版
  7. Huang, H. H.,Chen, H. H.(2011).Pause and stop labeling for Chinese sentence boundary detection.Proceedings of the International Conference Recent Advances in Natural Language Processing 2011,Hissar, Bulgaria:
  8. Huang, H. H.,Sun, C. T.,Chen, H. H.(2010).Classical Chinese sentence segmentation.CIPS-SIGHAN Joint Conference on Chinese Language Processing
  9. Huang, L.,Peng, Y.,Wang, H.,Wu, Z.(2002).Statistical part-of-speech tagging for classical Chinese.Proceedings of the 5th International Conference on Text, Speech and Dialogue (TSD ’02),Berlin, Heidelberg:
  10. Huang, S.,Wu, J.(2018).A pragmatic approach for classical Chinese word segmentation.Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018),Miyazaki, Japan:
  11. Lamb, A.,Clanuwat, T.,Kitamoto, A.(2020).KuroNet: Regularized residual U-Nets for end-to-end Kuzushiji character recognition.SN Computer Science,1(177),1-15.
  12. Lee, J.(2012).A classical Chinese corpus with nested part-of-speech tags.Proceedings of the 6th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities,Avignon, France:
  13. Marcus, M. P.,Santorini, B.,Marcinkiewicz, M. A.(1993).Building a large annotated corpus of English: The Penn Treebank.Computational linguistics,19(2),313-330.
  14. Oke, S. A.(2008).A literature review on artificial intelligence.International Journal of Information and Management Sciences,19(4),535-570.
  15. Simistira, F.,Seuret, M.,Eichenberger, N.,Garz, A.,Liwicki, M.,Ingold, R.(2016).Diva-hisdb: A precisely annotated large dataset of challenging medieval manuscripts.International Conference on Frontiers in Handwriting Recognition (ICFHR)
  16. Wang, F. Y.,Zhang, J. J.,Zheng, X.,Wang, X.,Yuan, Y.,Dai, X.,Yang, L.(2016).Where does AlphaGo go: From church-turing thesis to AlphaGo thesis and beyond.IEEE/CAA Journal of Automatica Sinica,3(2),113-120.
  17. Xu, J.,Sun, X.(2016).Dependency-based gated recursive neural network for Chinese word segmentation.Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics
  18. Xue, N.,Xia, F.,Chiou, F. D.,Palmer, M.(2005).The Penn Chinese TreeBank: Phrase structure annotation of a large corpus.Natural Language Engineering,11(2),207-238.
  19. Yang, J.,Zhang, Y.,Dong, F.(2017).Neural word segmentation with rich pretraining.Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics
  20. Yang, W.,Jin, L.,Tao, D.,Xie, Z.,Feng, Z.(2016).DropSample: A new training method to enhance deep convolutional neural networks for large-scale unconstrained handwritten Chinese character recognition.Pattern Recognition,58,190-203.
  21. Zhao, Y.,Wang, C.,Fu, G.(2012).A CRF sequence labeling approach to Chinese punctuation prediction.Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation,Bali,Indonesia:
  22. 目加田慶人(2019)。人工知能と台湾総督府文書。社会科学研究,39(2),226-222。
  23. 李康穎、Batjargal, Biligsaikhan、前田亮(2019)。落款印および関連情報の検索システムの構築:人物情報と人物関係ネットワークの自動抽出に向けて。人文科学とコンピュータシンポジウム論文集
  24. 陳映舟(2001)。新竹市,國立交通大學。
  25. 陳龍貴(編),周維強(編)(2016).國立故宮博物院藏清代琉球史料彙編:軍機處檔奏摺錄副.臺北:國立故宮博物院.
  26. 馮明珠(編),林國平(編)(2012).十年耕耘•百年珍藏:國立故宮博物院數位典藏成果專刊.臺北:國立故宮博物院.
  27. 馮明珠,許玉純(2004)。國立故宮博物院「清代檔案奏摺及軍機處檔摺件全文影像資料庫」數位化流程與使用。數位典藏:作業規劃與品質管理研討會論文集,臺北:
  28. 黃永泰(編),郭鎮武(編),盧鴻興(編)(2019).盛宣懷與南洋公學史料彙編.新竹:國立交通大學出版社.