题名

一種多模型融合的中文古籍OCR後處理方法

并列篇名

A Post-OCR Method of Multi-Model Ensemble for Chinese Ancient Scriptures

DOI

10.6853/DADH.202304_(11).0003

作者

釋賢超(Xianchao Shi)

关键词

post-OCR ; 古籍 ; 模型融合 ; 版面分析 ; 深度學習 ; post-OCR ; ancient scriptures ; model ensemble ; layout analysis ; deep learning

期刊名称

數位典藏與數位人文

卷期/出版年月

11期(2023 / 04 / 01)

页次

83 - 104

内容语文

繁體中文;英文

中文摘要

本文提出一種多模型融合的OCR後處理方法,採用獨特的版面分析和對齊算法,整合了整頁檢測模型、字識別模型、列識別模型與語言預訓練模型等深度學習模型,實現了超越單一模型的效果。全文錯誤率達到1.64%,僅為單一模型平均錯誤率的23%。在各類常規古籍版式場景中,該方法具有較好的泛用性。

英文摘要

This paper proposes a post-OCR method of multi-model ensemble, which uses a unique layout analysis and alignment algorithms, and integrate different types of deep learning models, such as the full-page character detection model, character recognition model, line recognition model and language pre-training model, and achieves effects beyond a single model. The full-text error rate reaches 1.64%, which is only 23% of the average error rate of a single model. In various conventional ancient book layout scenarios, this method has good generalization.

主题分类 人文學 > 人文學綜合
基礎與應用科學 > 資訊科學
参考文献
  1. ethanyt. (n.d.-a). guwenbert-base. Retrieved from https://huggingface.co/ethanyt/guwenbert-base。
  2. ethanyt. (n.d.-b). guwenbert-large. Retrieved from https://huggingface.co/ethanyt/guwenbert-large。
  3. 宋.司馬光(2023)。司馬氏書儀。取自 https://new.shuge.org/wp-content/uploads/2023/02/sima_shi_shu_yi018.jpg。
  4. Dai, J.,Li, Y.,He, K.,Sun, J.(2016).R-FCN: Object detection via region-based fully convolutional networks.Advances in Neural Information Processing Systems 29,Red Hook, NY:
  5. Devlin, J.,Chang, M.-W.,Lee, K.,Toutanova, K.(2019).BERT: Pre-training of deep bidirectional transformers for language understanding.Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human language technologies, volume 1 (long and short papers),Minneapolis, MN:
  6. DGrouin, C.,Grouin, C.,Grau, B.(2017).Generating a training corpus for OCR post-correction using encoder-decoder model.Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers),Taipei:
  7. KoichiYasuoka. (n.d.-a). roberta-classical-chinese-base-char. Retrieved from https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-base-char。
  8. KoichiYasuoka. (n.d.-b). roberta-classical-chinese-large-char. Retrieved from https://huggingface.co/KoichiYasuoka/roberta-classical-chinese-large-char。
  9. Liu, Y.,Ott, M.,Goyal, N.,Du, J.,Joshi, M.,Chen, D.,Stoyanov, V.(2019).,未出版
  10. Ma, W.,Zhang, H.,Jin, L.,Wu, S.,Wang, J.,Wang, Y.(2020).Joint layout analysis, character detection and recognition for historical document digitization.2020 17th International Conference on Frontiers in Handwriting Recognition,Dortmund, Germany:
  11. Magallon, T.,Béchet, F.,Favre, B.(2018).Détection d’erreurs dans des transcriptions OCR de documents historiques par réseaux de neurones récurrents multi-niveau.Actes de la Conférence TALN. Volume 1—Articles longs, articles courts de TALN,Rennes, France:
  12. Nguyen, T.-T. H.,Jatowt, A.,Nguyen, N.-V.,Coustaty, M.,Doucet, A.(2020).Neural machine translation with BERT for post-OCR error detection and correction.Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020,New York, NY:
  13. Redmon, J.,Divvala, S.,Girshick, R.,Farhadi, A.(2016).You only look once: Unified, real-time object detection.2016 IEEE Conference on Computer Vision and Pattern Recognition,Las Vegas, NV:
  14. Redmon, J.,Farhadi, A.(2018)。,未出版
  15. Ren, S.,He, K.,Girshick, R.,Sun, J.(2015).Faster R-CNN: Towards real-time object detection with region proposal networks.Advances in Neural Information Processing Systems 28,Red Hook, NY:
  16. Wu, S.,Wang, J.,Ma, W.,Jin, L.(2020).Precise detection of Chinese characters in historical documents with deep reinforcement learning.Pattern Recognition,107,107503.
  17. Xie, Z.,Huang, Y.,Jin, L.,Liu, Y.,Zhu, Y.,Gao, L.,Zhang, X.(2019).Weakly supervised precise segmentation for historical document images.Neurocomputing,350,271-281.
  18. Yang, H.,Jin, L.,Huang, W.,Yang, Z.,Lai, S.,Sun, J.(2018).Dense and tight detection of Chinese characters in historical documents: Datasets and a recognition guided detector.IEEE Access,6,30174-30183.
  19. Yang, H.,Jin, L.,Sun, J.(2018).Recognition of Chinese text in historical documents with page-level annotations.2018 16th International Conference on Frontiers in Handwriting Recognition,Niagara Falls, NY:
  20. Zhang, S.,Huang, H.,Liu, J.,Li, H.(2020).Spelling error correction with soft-masked BERT.Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
  21. 中國人工智能產業發展聯盟(2020)。OCR 服務智能化分級技術要求和評估方法。取自 http://aiiaorg.cn/uploadfi le/2020/0330/20200330042334199.pdf。