题名

利用文字探勘建立醫學主題詞與基因名稱之關聯性

并列篇名

Associations between medical subject headings and gene names based on text-mining in PubMed

DOI

10.6288/TJPH.201802_37(1).106086

作者

林宜歆(Yi-Hsin Lin);林嶔(Chin Lin);葉釋仁(Shih-Jen Yeh);蘇遂龍(Sui-Lung Su)

关键词

文字探勘 ; 醫學主題詞 ; 基因名稱 ; 常一起出現在摘要中的相關性 ; text-mining ; medical subject headings (MeSH) ; gene names ; often appearing in abstracts

期刊名称

台灣公共衛生雜誌

卷期/出版年月

37卷1期(2018 / 02 / 15)

页次

12 - 23

内容语文

繁體中文

中文摘要

目標:近年來生物醫學文獻發表量日益增加,有必要借助電腦自動化整理大量文獻並提供有用的資訊。目前自動化整理生物資訊之著名相關網站如Coremine、STRING、DisGeNet等,但都看不到字詞間的間接相關性。本研究欲探討PubMed收錄之非結構化摘要中,醫學主題詞(MeSH)與基因名稱間不同年代使用次數的高低情形與各詞彙間的相關強度。方法:本篇研究所採取的研究設計為text-mining design,研究樣本為2016年7月8日檢索PubMed並下載共26,295,751篇文獻。分別利用美國國立醫學圖書館與國際人類基因組組織命名委員會,編製的醫學主題詞與基因的正式名稱,檢索同義字後建立27,883個醫學主題詞的字庫與39,903個人類基因的字庫;文字比對擷取各摘要中包含的醫學主題詞與基因名稱,以年為單位計算各詞彙於摘要中出現次數,並使用word2vec分析詞彙間的相關強度。結果:本研究建立互動式網站,提供查詢醫學主題詞與基因名稱在各年代摘要中出現次數與頻率,以及最常一起出現在摘要中的相關字詞(https://yihsin.shinyapps.io/meshgeneterm_relation/)。本研究發現在2012年開始有很多文章在摘要中提到China,次數排名於2016年擠進前8名,象徵中國在學術界的崛起;Health從排行第7到第3名,也許表示越來越重視健康的議題。舉退化性關節炎為例,與退化性關節炎最常一起出現在摘要中有截肢手術、髕骨、膝關節與癱瘓等,同時也看的到這些字間的間接相關。結論:利用本研究建立的網站,瞭解各醫學主題詞與基因名稱在摘要中不同年代使用次數與頻率,以及最常與哪些字一起出現的相關強度,讓研究者在探索新領域時能快速有概括性的了解,取得建議研究的方向,以利往後跨領域之科學研究。

英文摘要

Objectives: In recent years, the biomedical literature has expanded by leaps and bounds. Based on studies available in the databases, it is difficult for users to sort through the massive literature and organize sets of qualitative data. At present, there are well-known websites such as Coremine, STRING, and DisGeNet; however, one can inquire only about words directly related to the search words without any indirectly relevant suggestions. Thus, there is a real need to address the issue. In this study, we investigated the number of uses in each year and the relationship between medical subject headings (MeSH) and gene names in the non-structured abstracts in PubMed. Methods: The study used a text-mining design. The study samples were the 26,295,751 articles in PubMed on July 8, 2016. Using the MeSH from the American National Library of Medicine in the MeSHBrowser, we identified 27,883 words to establish the MeSH dictionary. Genes were officially named by the Human Genome Organization Nomenclature Committee. A search of NCBI Gene yielded a dictionary of 39,903 human genes. The medical subject headings and gene names included in the abstracts were then extracted and calculated by year. We used word2vec to analyze the associations between the MeSH and gene names. Results: We built an interactive website which provides information about the number of uses in different years and the relevant words that most often appeared together with MeSH and gene names in the abstract (https://yihsin.shinyapps.io/meshgeneterm_relation/. For example, words which appeared most often together with Osteoarthritis were Osteotomy, Nails, Patella, Knee joint, Physical examination and Paralysis. There are also indirectly relevant suggestions. Conclusions: The website developed in this study provides the number of uses in different years of MeSH and gene names, and what words were most associated with them; this indicates that these words were often mentioned and discussed together in medical publications. We can also see the indirect correlations between them so that researchers exploring new areas can quickly have a general understanding of the field.

主题分类 醫藥衛生 > 預防保健與衛生學
醫藥衛生 > 社會醫學
参考文献
  1. Cook K. Unstructured data and the 80 percent rule. Available at: http://clarabridge.com/default.aspx?tabid=137&ModuleID=635&ArticleID=551. Accessed December 3, 2012
  2. U.S. National Library of Medicine. Medical subject headings. Available at: https://www.nlm.nih.gov/mesh/introduction.html. Accessed June, 2016.
  3. PubGene. Coremine medical. Available at: https://www.coremine.com/medical/#search. Accessed November, 2016.
  4. HUGO Gene Nomenclature Committee. Website information. Available at: http://www.genenames.org/. Accessed June, 2016.
  5. Mikolov T, Sutskever I, Chen K, Corrado G, Dean J. Distributed representations of words and phrases and their compositionality. Available at: https://arxiv.org/abs/1310.4546. Accessed June, 2016.
  6. Bauer-Mehren, A,Bundschus, M,Rautschka, M,Mayer, MA,Sanz, F,Furlong, LI(2011).Furlong Genedisease network analysis reveals functional modules in mendelian, complex and environmental diseases.PLoS One,6,e20284.
  7. Bengio, Y,Ducharme, R,Vincent, P,Jauvin, C(2003).A neural probabilistic language model.J Mach Learn Res,3,1137-55.
  8. Bongers, EM,Gubler, MC,Knoers, NV(2002).Nail-patella syndrome. Overview on clinical and molecular findings.Pediatr Nephrol,17,703-12.
  9. Chen, H,Lun, Y,Ovchinnikov, D(1998).Limb and kidney defects in Lmx1b mutant mice suggest an involvement of LMX1B in human nail patella syndrome.Nat Genet,19,51-5.
  10. Guidera, KJ,Satterwhite, Y,Ogden, JA,Pugh, L,Ganey, T(1991).Nail patella syndrome: a review of 44 orthopaedic patients.J Pediatr Orthop,11,737-42.
  11. Hamosh, A,Scott, AF,Amberger, JS,Bocchini, CA,McKusick, VA(2005).Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders.Nucleic Acids Res,33,D514-7.
  12. Hiatt, RA,Sulsky, S,Aldrich, MC,Kreiger, N,Rothenberg, R(2013).Promoting innovation and creativity in epidemiology for the 21st century.Ann Ep idemiol,23,452-4.
  13. Hilbert, M,López, P(2011).The world's technological capacity to store, communicate, and compute information.Science,332,60-5.
  14. Hilbert, M,López, P(2012).How to measure the world's technological capacity to communicate, store and compute information? Part I: results and scope.Int J Comm,6,956-79.
  15. Jensen, PB,Jensen, LJ,Brunak, S(2012).Mining electronic health records: towards better research applications and clinical care.Nat Rev Genet,13,395-405.
  16. Lin, YC,Wu, YH,Scher, RK(2008).Nail changes and association of osteoarthritis in digital myxoid cyst.Dermatol Surg,34,364-9.
  17. Meaney, C,Moineddin, R,Voruganti, T,O'Brien, MA,Krueger, P,Sullivan, F(2016).Text mining describes the use of statistical and epidemiological methods in published medical research.J Clin Epidemiol,74,124-32.
  18. Murdoch, TB,Detsky, AS(2013).The inevitable application of big data to health care.JAMA,309,1351-2.
  19. Niu, Y,Otasek, D,Jurisica, I(2010).Evaluation of linguistic features useful in extraction of interactions from PubMed; application to annotating known, highthroughput and predicted interactions in I2D.Bioinformatics,26,111-9.
  20. Piñero, J,Bravo, À,Queralt-Rosinach, N(2017).DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants.Nucleic Acids Res,45,D833-9.
  21. Raghupathi, W,Raghupathi, V(2014).Big data analytics in healthcare: promise and potential.Health Inf Sci Syst,2,3.
  22. Sacchi, L,Holmes, JH(2016).Progress in biomedical knowledge discovery: a 25-year retrospective.Yearb Med Inform,25(Suppl 1),S117-29.
  23. Salerno, J,Knoppers, BM,Lee, LM,Hlaing, WM,Goodman, KW(2017).Ethics, bigdata and computing in epidemiology and public health.Ann Epidemiol,27,297-301.
  24. Schmitt, T,Ogris, C,Sonnhammer, EL(2014).FunCoup 3.0: database of genome-wide functional coupling networks.Nucleic Acids Res,42,D380-8.
  25. Schultze, V,Pawlitschko, J(2002).The identification of outliers in exponential samples.Statistica Neerlandica,56,41-57.
  26. Singhal, A,Simmons, M,Lu, Z(2016).Text mining for precision medicine: automating disease-mutation relationship extraction from biomedical literature.J Am Med Inform Assoc,23,766-72.
  27. Szklarczyk, D,Franceschini, A,Wyder, S(2015).STRING v10: protein-protein interaction networks, integrated over the tree of life.Nucleic Acids Res,43,D447-52.
  28. Tigchelaar, S,Lenting, A,Bongers, EM,van Kampen, A(2015).Nail patella syndrome: knee symptoms and surgical outcomes. A questionnaire-based survey.Orthop Traumatol Surg Res,101,959-62.
  29. van, Driel, MA,Bruggeman, J,Vriend, G,Brunner, HG,Leunissen, JA(2006).A text-mining analysis of the human phenome.Eur J Hum Genet,14,535-42.
  30. Weber, GM,Mandl, KD,Kohane, IS(2014).Finding the missing link for big biomedical data.JAMA,311,2479-80.
  31. Yih, Wt,Toutanova, K,Platt, JC,Meek, C(2011).Learning discriminative projections for text similarity measures.Proceedings of the Fifteenth Conference on Computational Natural Language Learning, 2011,Oregan, USA:
  32. Zuberi, K,Franz, M,Rodriguez, H(2013).GeneMANIA prediction server 2013 update.Nucleic Acids Res,41,W115-22.
被引用次数
  1. 張益誠,張育傑,余泰毅(2021)。探討環境教育論文的文件自動分類技術-以2013-2018年環境教育研討會摘要為例。環境教育研究,17(1),85-128。