题名

DK-means:一個新的使用於資料庫進行資料探勘之高穩定性分群技術

并列篇名

DK-means: A Robust New Clustering Technique in Data Mining for Databases

DOI

10.29767/ECS.200712.0003

作者

蔡正發(Cheng-Fa Tsai);李俊璋(Chun-Chang Li)

关键词

資料探勘 ; 資料分群 ; K均值法 ; Data Mining ; Data Clustering ; K-Means

期刊名称

Electronic Commerce Studies

卷期/出版年月

5卷4期(2007 / 12 / 31)

页次

419 - 437

内容语文

繁體中文

中文摘要

隨著資訊科技的進步與發展,儲存在資料庫中的資料也隨之成長。資料探勘技術能夠幫助挖掘出隱含在資料中的有用資訊並且廣泛地應用於各領域中,尤其是資料分群更是最常用的資料分析模式。資料分群在各種應用領域裡扮演著重要的角色。資料分群係描述資料在分群的運算過程,其中同一群組內的資料相似性高,然而不同群組內的資料則相似度低。通常我們是使用距離的測量來評估資料之非相似性(根據描述物件屬性的值)。資料分群演算法在最近幾年不斷被研發出來,其中K-means是快速、容易實作、並且可以找到資料分群的區域最佳解之方法。然而,K-means的主要缺點是難以去辨識任意形狀的圖形。本研究提出一個修正的K-means演算法,此演算法以距離觀念為基礎,可使資料分群的結果能夠較為穩定。經由模擬結果顯示本論文所提出的DK-means分群方法可產生良好精確的結果。

英文摘要

With the rapid progress of information technology, more and more amounts of data are produced and stored in the databases. Data mining helps to extract the useful information and be used widely in different areas, data clustering is an analytic mode that especially most frequent used. Data clustering plays an important role in various fields. Data clustering describes the process of grouping data into clusters such that the data in each cluster share a high degree of similarity while being very dissimilar to data from other clusters. Dissimilarities are evaluated according to the attribute values describing the objects. Usually, distance measures are used. Data clustering algorithms have been developed in recent years. K-means is fast, easily implemented and finds most local optima for data clustering. However, the crucial shortcoming of K-means is the difficultly of recognizing arbitrary shapes. This paper presents a modified k-means based on the concept of distance, and the proposed algorithm may enhance the stability in data clustering results. The simulation reveals that the proposed DK-means yields good accurate clustering results.

主题分类 基礎與應用科學 > 資訊科學
社會科學 > 經濟學
参考文献
  1. Bandyopadhyay, S.,Maulik, U.(2002).An evolutionary technique based on K-means algorithm for optimal clustering in RN.Information Sciences,146,221-237.
  2. Goldberg, D.E.(1989).Genetic Algorithms in Search, Optimization, and Machine Learning.MA:Addison-Wesley.
  3. Guha, S.,Rastogi, R.,Shim, K.(1999).ROCK: A Robust Clustering Algorithm for Categorical Attributes.Proceedings of 15th International Conference on Data Engineering
  4. Guha, S.,Rastogi, R.,Shim, K.(1998).CURE: An Efficient Clustering Algorithm for Large Data Bases.Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data,27(2),73-84.
  5. Holland, J.H.(1992).Adaptation in Natural and Artificial System.MA:MIT Press, Boston.
  6. Krishna, K.,Murty, M.N.(1999).Genetic K-means algorithm, IEEE Transactions on Systems.Man and Cybernetics-part B Cybernetics,29(3),433-439.
  7. McQueen, J.B.(1967).Some Methods of Classification and Analysis of Multivariate Observations.Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability
  8. Negnevitsky, M.(2002).Artificial Intelligence-A Guide to Intelligent Systems.Addison Wesley.
  9. Sander, J.,Ester, M.,Kriegel, H.,Xu, X.(1998).Density-based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications.Proceedings on Data Mining and Knowledge Discovery,2(2),169-194.
  10. Su, M.C.,Chang, H.T.(2000).Fast self-organizing feature map algorithm.IEEE Transactions on Neural Networks,11(3),721-733.
  11. Tsai, C. F.,Chen, Z. C.,Tsai, C. W.(2002).MSGKA: An Efficient Clustering Algorithm for Large Databases.2002 IEEE International Conference on Systems, Man, and Cybernetics,Tunisa:
  12. Tsai, C. F.,Liu, C. W.(2006).KIDBSCAN: A New Efficient Data Clustering Algorithm for Data Mining in Large Databases.Lecture Notes in Artificial Intelligence,4029,702-711.
  13. Tsai, C. F.,Tsai, C. W.,Wu, H. C.,Yang, T.(2004).ACODF: A Novel Data Clustering Approach for Data Mining in Large Databases.Journal of Systems and Software,73,133-145.
  14. Tsai, C. F.,Wang, T. P.(2006).GDH: An Effective and Efficient Approach to Detect Arbitrary Patterns in Clusters with Noises in Very Large Databases.Degree of master at National Pingtung University of Science and Technology.
  15. Tsai, C. F.,Wu, H. C.,Tsai, C. W.(2002).A New Data Clustering Approach for Data Mining in Large Databases.The 6th IEEE International Symposium on Parallel Architectures, Algorithms, and Networks (ISPAN'02)
  16. Tsai, C. F.,Yang, T.(2003).An Intuitional Data Clustering Algorithm for Data Mining in Large Databases.2003 IEEE International Conference on Informatics, Cybernetics, and Systems,Taiwan:
  17. Tsai, C. F.,Yen, C. C.(2007).ANGEL: A New Effective and Efficient Hybrid Clustering Technique for Large Databases.Lecture Notes in Artifical Intelligence, LNAI 4426.
  18. Wang, W.,Yang, J.,Muntz, R.(1999).STING+: An Approach to Active Spatial Data Mining.Proceedings of the International Conference on Data Engineering
  19. Wang, W.,Yang, J.,Muntz, R.(1997).STING: A Statistical Information Grid Approach to Spatial Data Mining.Proceedings of 23rd International Conference on Very Large Data Bases
  20. Zhang, H.,Ho, T.B.,Lin, M.S.(2004).An evolutionary k-means algorithm for clustering time series data.Proceedings of the Third International Conference on Machine Learning and Cybernetics,Shanghai:
  21. Zhang, T.,Ramakrishnan, R.,Livny, M.(1996).BIRCH: An Efficient Data Clustering Method for Very Large Data Bases.Proceedings of the ACM SIGMOD International Conference on Management of Data,25(2),103-114.