题名

一個調整不平衡資料以提升分類正確率的新方法

并列篇名

A New Method to Adjust Imbalanced Data to Improve Classification Accuracy

作者

李維平(Wei-Ping Lee);周賢明(Hsien-Ming Chou);林劭旻(Shao-Min Lin)

关键词

合成少數法 ; 過採樣 ; 欠採樣 ; 近似差錯 ; 決策樹 ; synthetic minority oversampling technique ; over sampling ; under sampling ; nearmiss ; decision tree

期刊名称

先進工程學刊

卷期/出版年月

18卷1期(2023 / 04 / 01)

页次

25 - 32

内容语文

繁體中文;英文

中文摘要

對於數據的處理方法,各領域都會遇到不同的難題,其中不平衡資料是一項較為棘手的課題。目前學術界有針對多數類的欠採樣,也有針對少數類的過採樣,但只要處理不妥,就容易在欠採樣時造成樣本本身重要資訊遺失,或是在過採樣時造成分類器過擬合。也有不少研究針對分類器進行改良、優化,但資料本身的品質優劣較大程度的影響了分類結果,分類器本身的改良對於分類結果較無顯著的幫助。本研究結合了SMOTE(Synthetic Minority Oversampling Technique)合成少數法、近似差錯(NearMiss)、欠採樣法來解決資料不平衡的問題,並和過採樣法、SMOTE法分別建立決策樹分類模型進行比較,最後透過實驗得知使用NMS(NearMiss-2 SMOTE)採樣法在四種不同數據的實驗中皆為最佳採樣方法,在少數類樣本的分類正確率也為各種採樣方法中最高的。

英文摘要

For data processing methods, various fields will encounter different problems, and unbalanced data is a more difficult subject. At present, academia has under-sampling for the majority of classes and over-sampling for the minority classes, but as long as it is not handled properly, it is easy to cause important information about the sample itself to be lost during under-sampling, or to over-fit the classifier during oversampling. There are also many studies that improve and optimize the classifier, but the quality of the data itself has a greater impact on the classification results, and the improvement of the classifier itself has no significant help to the classification results. This study combines SMOTE (Synthetic Minority Oversampling Technique) and NearMiss to solve the problem of data imbalance, and compare it with the oversampling method and SMOTE method to establish the decision tree classification model. Finally, through experiments, it is found that the NMS (NearMiss-2 SMOTE) sampling method is the best in the four different data experiments. The best sampling method, the classification accuracy rate of the minority samples is also the highest among various sampling methods.

主题分类 工程學 > 工程學綜合
工程學 > 工程學總論
工程學 > 土木與建築工程
工程學 > 機械工程
工程學 > 化學工業
参考文献
  1. Anzanello, M. J.,Borges, D. L.,de Souza, R. C.(2020).Ensemble of modified ENN and Tomek Link for imbalanced dataclassification.Applied Soft Computing,96,106617.
  2. Barua, S.,Lslam, M.,Yao, X.,Murase, K.(2014).MWMOTE--Majority Weighted Minority Oversampling Technique forImbalanced Data Set Learning.IEEE Transactions on Knowledgeand Dara Eegineering,26(2),405-425.
  3. Chawla, N. V.,Bowyer, K. W.,Hall, L. O.,Kegelmeyer, W. P.(2002).SMOTE: Synthetic minority over-sampling technique.Journal of artificial intelligence research,16,321-357.
  4. Douzas, G.,Bacao, F.(2018).Effective imbalanced deep learning through loss-based instance weighting.Neurocomputing,321,310-321.
  5. Han, H.,Wang, W. Y.,Mao, B. H.(2005).Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning.Advances in intelligent computing
  6. He, H.,Bai, Y.,Garcia, E. A.,Li, S.(2008).ADASYN: AdaptiveSynthetic Sampling Approach for Imbalanced Learning.IEEEWorld Congress on Computational Intelligence
  7. Japkowicz, N.,Stephen, S.(2002).The class imbalance problem: A systematic study.Intelligent Data Analysis,6(5),429-449.
  8. Khan, A. M.,Siddiqa, A.,Kim, K. H.(2021).A novel adaptiveweighted NearMiss technique for imbalanced datasets.Neurocomputing,438,297-310.
  9. Khurana, U.,Gupta, A.,Agarwal, S.(2021).SMT-DT: SMOTE-based decision tree approach for imbalance data classification.Information Processing & Management,58(2),102465.
  10. Kwon, S. Y.,Choi, S. H.,Park, S.,Kim, K.(2019).Effectivedata augmentation using SMOTE and NMS for imbalanced dataclassification.Neurocomputing,364,76-88.
  11. Li, H.,Li, T.,Li, Y.,Zhang, C.(2019).ADOM: anomalydetection with outlier margin.Information Sciences,478,62-77.
  12. Li, J.,Zhang, C.,Li, X.,Li, M.,Shang, C.(2021).An ensemble method based on ENN and Tomek Link for imbalanced learning.Neural Computing and Applications,33(6),2497-2510.
  13. Liu, B.,Wang, Y.,Wu, J.,Wu, Z.,He, H.(2019).A novelundersampling method based on near-miss for imbalanced datasets.Knowledge-Based Systems,168,107-120.
  14. Lu, Y.,Li, M.,Li, L.(2021).An Effective Data AugmentationMethod for Imbalanced Data Classification Based on Near-Miss and SMOTE.IEEE Access,9,16565-16576.
  15. Mu, Y.,Liu, L.,Zhang, Y.,Wu, X.(2019).A novelundersampling algorithm based on the integration of theneighborhood cleaning rule and the Tomek link.Knowledge-Based Systems,180,40-51.
  16. Mustafa, G.,Niu, Z.,Yousif, A.,Tarus, J.(2015).Solving the Class Imbalance Problems using RUSMultiBoost Ensemble.InformationSystems and Technologies (CISTI), 2015 10th Iberian Conferenceon
  17. Ng, M.,See, J.(2022).Evaluating the Impact of Over-Sampling Techniques on the Accuracy of Short Text Classification Models.Information Sciences,581,54-68.
  18. Niu, B.,Wu, Y.,Wang, L.,Wang, Y.,Zhang, Y.(2019).AnEfficient Hybrid Algorithm Combining ENN and Tomek Links for Imbalanced Data Classification.IEEE Access,7,129098-129112.
  19. Peng, W.,Chen, S.,Liu, S.(2021).An ensemble learningmethod with ENN-Tomek links for imbalanced data classification.Neural Computing and Applications,33(9),4241-4257.
  20. Sun, L.,Wang, Y.,Li, B.,Wu, J.(2020).An adaptive near-miss undersampling method for imbalanced datasets.Expert Systemswith Applications,139,112841.
  21. Sun, Z.,Song, Q.,Zhu, X.,Sun, H.,Xu, B.,Zhou, Y.(2015).A novel ensemble method for classifying imbalanced data.PatternRecognition,48,1623-1637.
  22. Wei, H.,Li, X.,Chen, L.,Zhang, X.(2021).NearMiss-2T: Anovel undersampling algorithm for imbalanced classificationproblems.Knowledge-Based Systems,225,107139.
  23. Xu, Q.,Zhou, X.,Yang, Y.,Wang, X.(2020).Imbalanced DeepLearning by Minority Class Incremental Rectification.IEEETransactions on Neural Networks and Learning Systems,31(4),1238-1251.
  24. Yang, C.,Zhang, S.,Zhang, Z.,Huang, J.(2018).A hybridapproach based on NearMiss and SMOTE for imbalanced dataclassificatio.Journal of Intelligent & Fuzzy Systems,34(5),2837-2847.
  25. Zhou, X.,Zhang, J.,Hu, F.,Zhou, Q.(2020).An improved near-miss algorithm for imbalanced data classification.Applied SoftComputing,96,106665.
  26. Zuo, Y.,Li, Y.,Li, H.,Wang, Y.(2020).NearMiss-RUSBoost: A novel ensemble method for imbalanced data classification.Knowledge-Based Systems,207,106435.