题名

Statistics in Big Data

并列篇名

大數據時代的統計學

作者

陳章榮(James J. Chen);陳逸凡(Eric Evan Chen);趙衛中(Wei-Zhong Zhao);鄒文(Wen Zou)

关键词

數據分析 ; 數據挖掘 ; 數據科學 ; 穀歌流感趨勢 ; data analytics ; data mining ; data science ; Google Flu Trends

期刊名称

中國統計學報

卷期/出版年月

53卷3期(2015 / 09 / 01)

页次

186 - 202

内容语文

英文

中文摘要

生物醫學計算機科學以及數據儲存技術的發展引發了數據信息的大爆炸,也帶來了數據的獲取處理管理傳輸以及分析方面的挑戰。大數據的價值在於對其信息的分析結果產生的新的認識和行動。大數據分析的目標是為了從數據中獲取知識來得出結論和作出決定。本文展示了統計學在大數據分析時代的應用和展望。統計學是一門古老的科學,它採用基於概率論的方法進行數據的分析和推論。對大數據分析有用的統計學和數據挖掘方法包括:顯著性測試,分類,回歸/預測,聚類分析,關聯式規則,異常檢測和視覺化。統計學分析為從數據到知識再到行為的過程提供了科學證明,是大數據分析所不可或缺的。另外,大數據分析需要較好的處理信息的計算機技能,程式編程能力,以及具有各種應用領域的專業知識。統計學家能夠勝任大數據浪潮中的領導作用。

英文摘要

Technological advances in biomedicine, computing, and storage have led to an explosion of digital information and present new challenges in data acquisition, processing, management, transferring, and analysis. The value of big data lies in the analytical use of its information to generate knowledge and action. The goal of big data analytics is to extract knowledge from the data to draw conclusions and make decisions. The purpose of this article is to present a view of prospects of statistics in the context of big data analytics. Statistics is a very old discipline for data analysis and data inference using methods based on probability theory. Statistics and data mining techniques that are useful for big data analytics include: significance testing, classification, re-gression/prediction, cluster analysis, association rule learning, anomaly detection, and visualization. Statistical analysis provides a scientific justification to move from data to knowledge to action, and is essential to big data analytics. In addition, big data analytics requires good computer skills in information processing and programming skills as well as knowledge expertise that can be applied to the domain of applications. Statisticians can serve a leadership role in the big data movement.

主题分类 基礎與應用科學 > 統計
参考文献
  1. Jordan, J. M.,Lin, D. K. J.(2014).Statistics for Big Data: Are Statisticians Ready for Big Data?.ICSA Bulletin,26,58-65.
    連結:
  2. Blei, D. M.,Ng, A. Y.,Jordan, M. I.(2003).Latent Dirichlet Allocation.Journal of Machine Learning Research,3,996-1022.
  3. Breiman, L.(2001).Random forest.Mach. Learning,45,5-32.
  4. Brieman, L.,Friedman, J.,Olshen, R.,Stone, C.,Steinberg, D.(1995).CART: Classi cation and Regression Trees.Stanford, CA.:
  5. Chen, C. H.(2002).Generalized Association Plots: Information Visualization via Iteratively Generated Correlation Matrices.Statistica Sinica,12,7-29.
  6. Chen, Chun-houh.,Härdle, Wolfgang,Unwin, Antony(2008).Handbook of Data Visualization.Berlin, Germany:Springer.
  7. Cleveland, William S.(1994).The Elements of Graphing Data.Summit, NJ:Hobart Press.
  8. Cox, DR,Oakes, D.(1984).Analysis of survival data.London, UK:Chapman and Hall.
  9. Davidian, M.,Louis, T. A.(2012).Why statistics?.Science,336,12.
  10. Ginsberg, J.,Mohebbi, M. H.,Patel, R. S.,Brammer, L.,Smolinski, M. S.,Brilliant, L.(2009).Detecting influenza epidemics using search engine query data.Nature,457,1012-1014.
  11. Goodnight, G.(2011).Executive Edge: Statistics make the world work better.analytics magazine
  12. Guyon, I.,Weston, J.,Barnhill, S.,Vapnik, V.(2002).Gene selection for cancer classi cation using support vector machines.Machine Learning,46,389-422.
  13. Haha, G. J.,Doganaksoy, N.(2011).A Career in Statistics: Beyond the Numbers.John Wiley & Sons.
  14. Hastie, T.,Tibshirani, R.,Friedman, J.(2001).The Elements of Statistical Learning: Data Mining, Inference, and Prediction.Springer.
  15. Jacoby, William G.(1998).Statistical Graphics for Visualizing Multivariate Data.Thousand Oaks, CA:Sage.
  16. Kotsiantis, S. B.(2007).Supervised machine learning: A review of classi cation.Techniques Informatica,31,249-268.
  17. Laney, D.(2001).,Gartner.
  18. Lazer, D.,Kennedy, R.,King, G.,Vespignani, A.(2014).The Parable of Google Flu: Traps in Big Data Analysis.Science,343,1203-1205.
  19. McCullagh, P.,Nelder, J. A.(1989).Generalized Linear Model.London:Chapman Hall.
  20. Tukey, John W.(1977).Exploratory Data Analysis.Reading, MA:Addison-Wesley Publishing Company.
  21. Uesaka, H.(2007).Sample size allocation to regions in a multiregional trial.Journal of Biopharmaceutical Statistics,19,580-594.
  22. Vapnik, V.(1998).Statistical learning theory.New York:Wiley.
被引用次数
  1. 樊祖燁,趙麗萍,楊宗達,周珮雯,吳政軒(2019)。結交好友,動出健康-“Group Fitness"健身交友平台行銷企劃之實例分析。美和學報,38(1),17-31。
  2. 徐珮清(2016)。台灣職業婦女對整體造型設計喜好之研究。台北海洋技術學院學報,7(2),1-9。