题名

高維度資料特徵選取之探討-應用於分類蛋白質質譜儀資料

并列篇名

On Feature Selection of High Dimensional Data-Application on Classifying Proteomic Spectra Data

DOI

10.6338/JDA.201106_6(3).0004

作者

郭訓志(Hsun-Chih Kuo);黃仁澤(Jen-Tse Hunag);薛慧敏(Huey-Miin Hsueh)

关键词

特徵選取 ; 蛋白質質譜儀資料 ; 支援向量機 ; 交叉驗證 ; feature selection ; proteomic spectra ; SVM ; cross-validation

期刊名称

Journal of Data Analysis

卷期/出版年月

6卷3期(2011 / 06 / 01)

页次

72 - 83

内容语文

繁體中文

中文摘要

一般健檢的腫瘤指標的靈敏度和特異性皆不高,也無法偵測較小的腫瘤,因此通常無法及早診斷出腫瘤。本研究的資料為應用蛋白質晶片與表面強化雷射解吸電離飛行質譜技術(SELDI)的血清蛋白質質譜資料,血清樣本來自健康的正常人以及三組不同時期的攝護腺癌症病人。研究目的在選取有助於區分不同時期攝護腺癌症的蛋白質特徵,利用重複隨機抽樣的交叉驗證和支援向量機(Support Vector Machine),先以t檢定的平均p值、Kruskal-Wallis檢定的平均p值、或平均分錯率對於所有蛋白質特徵進行排序,再利用向前選取方式找出最小分錯率模型之特徵變數。為了精簡模型,本研究同時考慮佐以相關係數與判定係數萃取後的特徵變數之分類結果。在各個方法比較上,使用Kruskal-Wallis檢定之最小p值特徵選取法的分類效果較好,而輔助的萃取方法以最大相關係數萃取法最能有效縮減特徵個數,同時又保持分類效果。

英文摘要

Often the time the tumor marker of regular health evaluation is low in sensitivity and specificity so that it could not detect tumor of small size in time. This research aims to develop a classification tool for early diagnosis of tumor by studying proteomic mass spectra of prostate cancer data at different stages. The prostate cancer data studied are the Surface-Enhanced Laser Desorption/Ionization Time-of-Flight Mass Spectrometry (SELDI-TOF-MS) generated from 327 serum samples. Of the 327 serum samples, 81 are from unaffected healthy men (HM), 78 are from patients diagnosed with benign prostatic hyperplasia (BPH), 84 are from patients with organ-confined PCA (T1/T2), and 84 are from patients with non-organ-confined PCA (T3/T4). The goal of this research is to select features (peaks) of the mass spectra that are useful for classifying different stages of prostate cancer via repeated random subsampling cross-validation. The forward minimum-p_value method (derived from t test or Kruskal-Wallis test) and the forward minimum-classification-error method incorporated with SVM are proposed in this study. In addition, maximum-correlation method and maximum-R2 method are considered for further feature selection. In comparison, the forward minimum-p_value method derived from Kruskal-Wallis test often outperforms other methods in terms of classification rate. Moreover, the maximum-correlation method not only can reduce the number of features effectively but also can preserve the classification rate at the same time.

主题分类 基礎與應用科學 > 資訊科學
基礎與應用科學 > 統計
社會科學 > 管理學
参考文献
  1. 賴基銘,「癌症篩檢未來的展望:SELDI血清蛋白指紋圖譜的應用」,國家衛生研究院電子報,第52期,2004年。取自http://enews.nhri.org.tw/enews_list_new3.php?volume_indx=52&enews_dt=2004-06-25
  2. 長庚大學台灣蛋白質體學簡介(2002)。取自http://memo.cgu.edu.tw/inscorelab/corelab/Intro.htm。
  3. 衛生署民國93年死因統計結果摘要(2004)。取自http://doh.gov.tw/statistic/index.htm。
  4. Adam, BL,Qu, Y,Davis, JW,Ward, MD,Clements, MA,Cazares, LH,Semmes, OJ,Schellhammer, PF,Yasui, Y,Feng, Z,Wright, GL, Jr.(2002).Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men.Cancer Research,62,3609-3614.
  5. Alpaydm, E.(2004).Introduction To Machine Learning.Combridge, MA:MIT Press.
  6. Conover, W. J.(1999).Practical Nonparametric Statistics.New York:Wiley.
  7. Fung, E. T.,Enderwick, C.(2002).ProteinChip Clinical Proteomics: Computational Challenges and Solutions.Computational Proteomics Supplement,32,S34-S41.
  8. Qu, Y.,Adam, B. l.,Thornquist, M.,Potter, J. D.,Thompson, M. L.,Yasui, Y.,Davis, J.,Schellhammer, P.,Cazares, L.,Clements, M., Jr.,Wright, G.L.,Feng, Z.(2003).Data Reduction Using a Discrete Wavelet Transform in Discriminant Analysis of Very High Dimensionality Data.Biometrics,59,143-151.
  9. Reddy, G.,Dalmasso, E. A.(2003).SELDI ProteinChip Array Technology: Protein-Based Predictive Medicine and Drug Discovery Applications.Journal of Biomedicine and Biotechnology,4,237-241.
  10. Sauve, A. C.,Speed, T. P.(2004).Normalization, Baseline Correction and Alignment of High-Throughput Mass Spectrometry Data.Proceedings Gensips
  11. Wagner, M.,Naik, D.,Pothen, A.(2004).Protocols for Disease Classification from Mass Spectrometry Data.Proteomics,3,1692-1698.
  12. 西滿正(1996)。癌的最新診斷與治療。台北:建宏。
  13. 黃建榮(2004)。碩士論文(碩士論文)。朝陽科技大學資訊管理系。