题名

從語料庫建構探討臺灣客語難字、缺字與異體字議題

并列篇名

Rare Characters, Missing Characters and Character Variants in Taiwan Hakka: An Exploration from Corpus Construction

DOI

10.6710/JTLL.202304_18(1).0003

作者

葉秋杏(Chiou-Shing YEH);賴惠玲(Huei-Ling LAI)

关键词

難字 ; 缺字 ; 異體字 ; 一字多碼 ; 臺灣客語語料庫 ; rare character ; missing character ; character variant ; multiple codes for the same character ; Taiwan Hakka Corpus

期刊名称

臺灣語文研究

卷期/出版年月

18卷1期(2023 / 04 / 01)

页次

135 - 183

内容语文

繁體中文;英文

中文摘要

臺灣客語文本中有許多難字、缺字及異體字,在在造成語料庫建置過程之語料用字處理作業繁複且紛雜。本文首先簡述臺灣客語的用字現況,包含民間具代表性的客語辭典與官方標準,其次依據《臺灣客語語料庫》建置經驗,介紹本語料庫的用字規範,並基於文本資料清理,探析文本用字校訂類型,包含客語拼音校訂為客語漢字、客語用字統整、多字刪除、缺字補齊、顛倒字序調換、形似字勘誤等。接續則檢視客語文本中難字無法正常顯示時出現的四種情形,包括拼音、借音或借義字、空格或符號(缺字)、漢字部件拆解,並展示相對應的處理方式。本文最後以探討如何克服字碼不一以及異體字等問題作結。

英文摘要

The digitization of Taiwan Hakka data is immensely complicated due to the many rare characters, missing characters, or character variants found in Taiwan Hakka texts, and is further hindered by inconsistency between non-governmental Hakka dictionaries' writing practice and governmental standards for the Hakka writing system. This study describes how the Taiwan Hakka Corpus Project carried out character correction to ensure the Corpus's usefulness and robustness. First, the study demonstrates the various types of character correction that take place in our text cleaning process, including converting Hakka spellings into characters, unifying different forms of the same word, deleting redundant or repeated characters, filling in missing characters, swapping reversed characters, and correcting characters similar in shape but dissimilar in meaning. Second, we investigate situations in which rare characters cannot be shown properly, and we provide solutions to each situation. These situations include rare characters in Hakka texts being substituted with (1) Hakka spellings, (2) phonetic or semantic loan characters, (3) unintended glyphs such as squares or symbols (i.e., missing characters), and (4) character decomposition. Finally, issues related to multiple codes for the same character and character variants in Hakka texts are tackled.

主题分类 人文學 > 人文學綜合
人文學 > 語言學
参考文献
  1. Aijmer, Karin(2009).So er I just sort I dunno I think it’s just because…: a corpus study of I don’t know and dunno in learners’ spoken English.Corpora: Pragmatics and Discourse: Papers from the 29th International Conference on English Language Research on Computerized Corpora (ICAME 29),Amsterdam:
  2. Aijmer, Karin,Stenström, Anna-Brita(2004).Discourse Patterns in Spoken and Written Corpora.Amsterdam:Benjamins.
  3. Baker, Paul(2006).Using Corpora in Discourse Analysis.London:Continuum.
  4. Carter, Ron,McCarthy, Michael(1997).Exploring Spoken English.Cambridge:Cambridge University Press.
  5. Chuang, Fei-Yu,Nesi, Hilary(2006).An analysis of formal errors in a corpus of Chinese student writing.Corpora,1,251-271.
  6. Ensslin, Astri,Johnson, Sally(2006).Language in the news: investigations into representations of ʻEnglishnessʼ using WordSmith Tools.Corpora,1,153-185.
  7. Gabrielatos, Costas,Torgerson, Eivind,Hoffmann, Sebastian,Fox, Susan(2010).A corpus-based sociolinguistic study of indefinite article forms in London English.Journal of English Linguistics,38,1-38.
  8. Halliday, Michael A. K.(1994).An Introduction to Functional Grammar.London:Edward Arnold.
  9. Hoey, Michael(2005).Lexical Priming: A New Theory of Words and Language.London:Routledge.
  10. Johansson, Stig(2007).Seeing Through Multilingual Corpora: On the Use of Corpora in Contrastive Studies.Amsterdam:John Benjamins.
  11. MacIver, Donald. 1905. An English-Chinese Dictionary in the Vernacular of the Hakka People in the Canton Province. Shanghai: American Presbyterian Mission Press.
  12. McEnery, Tony,Hardie, Andrew(2013).The history of corpus linguistics.The Oxford Handbook of the History of Linguistics,Oxford:
  13. Rey, Charles. 1901. Dictionnaire Chinois-Français, dialecte hac-ka. Hong Kong: Imprimerie de la Société des Missions Etrangères.
  14. Sinclair, John M.(1991).Corpus, Concordance, Collocation.Oxford:Oxford University Press.
  15. Stefanowitsch, Anatol and Stefan Th. Gries. 2022b. Unicode 15.0.0. Retrieved from https://www.unicode.org/versions/Unicode15.0.0/ (December 2, 2022).
  16. Stefanowitsch, Anatol,Gries, Stefan Th.(2003).Collostructions: investigating the interaction between words and constructions.International Journal of Corpus Linguistics,8,209-243.
  17. Stefanowitsch, Anatol and Stefan Th. Gries. 2022a. Unicode 15.0 character code charts. Retrieved from http://www.unicode.org/charts/ (December 2, 2022).
  18. Stefanowitsch, Anatol and Stefan Th. Gries. 2022c. Unicode standard annex 38: Unicode Han database (UNIHAN). Retrieved from https://www.unicode.org/reports/tr38/tr38-33.html(December 2, 2022).
  19. The Unicode Consortium. 2018. Unicode standard annex 38: Unicode Han database (UNIHAN). Retrieved from http://www.unicode.org/reports/tr38/tr38-25.html (January 9, 2022).
  20. Wong, May(2006).Corpora and intuition: a study of Mandarin Chinese adverbial clauses and subjecthood.Corpora,2,187-216.
  21. Xiao, Zhonghua,McEnery, Tony(2004).A corpus-based two-level model of situation aspect.Journal of Linguistics,40,325-363.
  22. 上地宏一. 2017.「花園フォント(花園明朝)〔電腦軟體〕」。取自:http://fonts.jp/hanazono/(查詢日期:2022.01.09)。
  23. 中原週刊社客家文化學術研究會(1992).客話辭典.苗栗:臺灣客家中原週刊社.
  24. 王雅萍,張如瑩,陳秀華,蕭貴徽(2012).數位化工作流程指南:文字資料.臺北:
  25. 安徒生,謝杰雄(譯)(2018).安徒生童話全集〔第一輯〕(國家語言【臺灣客語──四縣腔】).臺北:龍岡數位文化.
  26. 江敏華,黃彥菁,宋柏賢(2009)。客語文獻分析與數位典藏——以客英、客法大辭典為例。教育資料與研究雙月刊,91,131-160。
  27. 行政院主計總處. 2021.《109 年人口及住宅普查初步統計結果》。取自:https://www.dgbas.gov.tw/public/Attachment/1831151816OM26MHO7.pdf(查詢日期:2022.04.09)。
  28. 行政院客家委員會. 2008c.《臺灣饒平、大埔、詔安客語辭典──大埔分冊》。取自:https://cloud.hakka.gov.tw/site/hakka/public/Attachment/810161145271.pdf(查詢日期:2022.04.09)。
  29. 行政院客家委員會(2008)。行政院客家委員會. 2008a.《97 年度客語能力認證中級暨中高級考試試題【口試】(大埔D)》。臺北:行政院客家委員會。
  30. 行政院客家委員會. 2008d.《臺灣饒平、大埔、詔安客語辭典──詔安分冊》。取自:https://cloud.hakka.gov.tw/site/hakka/public/Attachment/8101611493671.pdf(查詢日期:2022.04.09)。
  31. 行政院客家委員會(2008)。行政院客家委員會. 2008b.《97 年度客語能力認證中級暨中高級考試試題【口試】(四縣D)》。臺北:行政院客家委員會。
  32. 行政院客家委員會. 2006.《95 年度臺灣客家民眾客語使用狀況調查》。取自:https://www.hakka.gov.tw/file/Attach/1990/1/891015293071.pdf(查詢日期:2022.04.09)。