题名

混合型資料集的K-means分群演算法

并列篇名

A k-means Based Clustering Algorithm for Mixed-Attribute Data Sets

DOI

10.6188/JEB.2017.19(1).01

作者

黃宇翔(Yeu-Shiang Huang);王品鈞(Ping-Chung Wang);方志強(Chih-Chiang Fang)

关键词

叢集分析 ; k-means ; 順序屬性 ; 距離量度 ; Clustering analysis ; k-means ; ordinal attribute ; distance measure

期刊名称

電子商務學報

卷期/出版年月

19卷1期(2017 / 06 / 01)

页次

1 - 28

内容语文

繁體中文

中文摘要

叢集分析為資料探勘分群技術之一,由於目前網路環境快速發展,資料屬性的種類與數量大量增加,導致傳統分群技術執行的效能大幅降低,傳統k-means 分群方法將難以應付。因此後續的相關研究則是針對數值、類別、順序等屬性資料的處理作為研究的重點。本研究以Ahmad and Dey(2007)所提出k-means 之衡量距離定義為基礎,針對三種屬性同時存在的資料集做叢集分析,並以各自不同的衡量距離定義作為分群考量,提出基因演算法以求得最佳衡量指標最好之群心組合,希望能提供各界應用,解決因三種混合的資料屬性所造成分群困難的實務問題。

英文摘要

Clustering is one of the most important analysis methods in data mining. In the wake of the fast development of networks technology, various types of data attribute and large numbers of data items cause the substantial inefficiency of data processing for clustering. Among different clustering approaches, partitioning clustering is relatively easier to implement and faster to perform than other ones. Different types of data attributes make clustering complicated. Most of literature focuses on numerical and categorical attributes or only ordinal attributes, respectively, but the results turn out to be less satisfactory in terms of accuracy and execution time. The proposed clustering approach, based on Ahmad and Dey (2007) k-means method, is advantageous in dealing with the three attributes: numerical, categorical and ordinal attributes simultaneously in which Euclidean distance is used to define the numerical similarity, the frequency of each value’s rank is used to indicate the categorical similarity, and the normalized distance is used to measure the ordinal similarity. The effectiveness of the proposed approach is evaluated by the use of an essential concept of clustering which is to minimize the ratio of the within cluster errors to the between cluster errors. A generic algorithm is also developed for reducing the execution time in dealing with the clustering of the three types of attributes at the same time. We hope the proposed method can provide a useful clustering technique for applications in practice.

主题分类 人文學 > 人文學綜合
基礎與應用科學 > 資訊科學
基礎與應用科學 > 統計
社會科學 > 社會科學綜合
参考文献
  1. Ahmad, A.,Dey, L.(2007).A k-mean clustering algorithm for mixed numeric and categorical data.Data & Knowledge Engineering,63(2),503-527.
  2. Angelis, L. D.,Dias, J. G.(2014).Mining categorical sequences from data using a hybrid clustering method.European Journal of Operational Research,234(3),720-730.
  3. Ankerst, M.,Breunig, M. M.,Kriegel, H. P.,Sander, J.(1999).OPTICS: Ordering points to identify the clustering structure.ACM SIGMOD Record,28(2),49-60.
  4. Bagozzi, R. P.(Ed.)(1994).Advanced methods of marketing research.Oxford:Black-well.
  5. Bezdek, J. C.(1981).Pattern recognition with fuzzy objective function algorithms.Norwell, MA, USA:Kluwer Academic Publishers.
  6. Bolshakova, N.,Azuaje, F.,Cunningham, P.(2005).An integrated tool for microarray data clustering and cluster validity assessment.Bioinformatics,21(4),451-455.
  7. Chan, E. Y.,Ching, W. K.,Ng, M. K.,Huang, J. Z.(2004).An optimization algorithm for clustering using weighted dissimilarity measures.Pattern Recognition,37(5),943-952.
  8. Corsini, P.,Lazzerini, B.,Marcelloni, F.(2006).Combining supervised and unsupervised learning for data clustering.Neural Computing and Applications,15(3-4),289-297.
  9. Dale, M. B.,Anand, M.,Desrochers, R. E.(2007).Measuring information-based complexity across scales using cluster analysis.Ecological Informatics,2(2),121-127.
  10. Dunn, J. C.(1973).A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters.Journal of Cybernetics,3(3),32-57.
  11. Ester, M.,Kriegel, H. P.,Sander, J.,Xu, X.(1996).A density-based algorithm for discovering clusters in large spatial databases with noise.Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD 1996),Portland, Oregon, USA:
  12. Gelbard, R.,Goldman, O.,Spiegler, I.(2007).Investigating diversity of clustering methods: An empirical comparison.Data and Knowledge Engineering,63(1),155-166.
  13. Goldberg, D. E.,Deb, K.(1991).A comparative analysis of selection schemes used in genetic algorithms.Foundations of Genetic Algorithms,1,69-93.
  14. He, Z.,Deng, S.,Xu, X.(2005).Improving k-modes algorithm considering frequencies of attribute values in mode.Lecture Notes in Computer Science,3801,157-162.
  15. He, Z.,Xu, X.,Deng, S.(2005).,未出版
  16. Hsu, C. C.,Chen, C. L.,Su, Y. W.(2007).Hierarchical clustering of mixed data based on distance hierarchy.Information Sciences,177(20),4474-4492.
  17. Huang, Z.,Ng, M. K.(1999).A fuzzy k-modes algorithm for clustering categorical data.IEEE Transactions on Fuzzy Systems,7(4),446-452.
  18. Jahirabadkar, S.,Kulkarni, P.(2014).Algorithm to determine e-distance parameter in density based clustering.Expert Systems with Applications,41(6),2939-2946.
  19. Kannan, S. R.,Devi, R.,Ramathilagam, S.,Takezawa, K.(2013).Effective FCM noise clustering algorithms in medical images.Computers in Biology and Medicine,43(2),73-83.
  20. Kim, M.,Ramakrishna, R. S.(2005).New indices for cluster validity assessment.Pattern Recognition Letters,26(15),2353-2363.
  21. Lee, M.,Brouwer, R. K.(2007).Fuzzy clustering and mapping of ordinal values to numerical.Proceedings of the 2007 IEEE Symposium on Foundations of Computational Intelligence (FOCI 2007),Honolulu, Hawaii, USA:
  22. MacQueen, J. B.(1967).Some methods for classification and analysis of multivariate observations.Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability,Berkeley, CA:
  23. Maulik, U.,Bandyopadhyay, S.(2002).Performance evaluation of some clustering algorithms and validity indices.IEEE Transactions on Pattern Analysis and Machine Intelligence,24(12),1650-1654.
  24. Rand, W. M.(1971).Objective criteria for the evaluation of clustering methods.Journal of the American Statistical Association,66(336),846-850.
  25. Roubens, M.(2003).Multiple criteria choice, ranking, and sorting in the presence of ordinal data and interactive points of view.Proceedings of the 10th international fuzzy systems association World Congress conference on Fuzzy sets and systems (IFSA 2003),Monterey, CA.:
  26. Sato-Ilic, M.(1998).Dynamic clustering model for ordinal similarity.Proceedings of the 1998 Conference of the North American Fuzzy Information Processing Society (NAFIPS 1998),Florida, USA:
  27. Sbai, E. H.(2001).Cluster analysis by adaptive rank-order filters.Pattern Recognition,34(10),2015-2027.
  28. Wagstaff, K.,Cardie, C.,Rogers, S.,Schrödl, S.(2001).Constrained k-means clustering with background knowledge.Proceeding of the 18th International Conference on Machine Learning (ICML 2001),Williamstown, MA, USA.:
  29. Wang, H.,Wang, W.,Yang, J.,Yu, P. S.(2002).Clustering by pattern similarity in large data sets.Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD 2002),Madison, USA:
  30. Wang, X.,Wang, X. L.,Chen, C.,Wilkes, D. M.(2013).Enhancing minimum spanning tree-based clustering by removing density-based outliers.Digital Signal Processing,23(5),1523-1538.
  31. Xiong, X.,Chan, K. L.,Tan, K. L.(2004).Similarity-driven cluster merging method for unsupervised fuzzy clustering.Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI 2004),Banff, Canada:
  32. Zheng, B.,Yoon, S. W.,Lam, S. S.(2014).Breast cancer diagnosis based on feature extraction using a hybrid of k-means and support vector machine algorithms.Expert Systems with Applications: An International Journal,41(4),1476-1482.