题名 |
A Low Dimensional Categorical Data Transform Based on Feature Combination |
DOI |
10.29428/9789860544169.201801.0016 |
作者 |
Wei-Zhih Lin;Wei-Chung Teng |
关键词 |
encoder ; categorical data ; feature combination ; pre-selection ; low-dimensional |
期刊名称 |
NCS 2017 全國計算機會議 |
卷期/出版年月 |
2017(2018 / 01 / 01) |
页次 |
83 - 87 |
内容语文 |
英文 |
中文摘要 |
Upon transforming categorical data into numerical one, current encoders have the drawback of generating high dimensional output. To decrease the dimension of output would unavoidably cause loss of information, and the amount of lost information is considered positively correlated with the number of dimension discarded. This work developed an efficient approach to extract and to reserve more information from the dataset. The numerical output by the proposed approach delivers higher accuracy and desires less computation time due to the limited number of dimensions. The first technique used in this approach is coined as feature combination (FC), which is to combine few columns unto one column of combinations. The second technique, pre-selection, is to select important columns according to information gain metric before executing FC. The proposed method was evaluated with the categorical data from UCI and CTU datasets. The results of the experiments showed that the features, after transforming by the proposed method, are of dimensions from 1 to 4 according to the numbers of datasets' label. Moreover, the accuracy of all the datasets with the proposed method are almost 2 percent higher than OneHotEncoder. Although the improvement in accuracy is not remarkable, the number of dimensions of features are at least 20 times lower than that of OneHotEncoder. |
主题分类 |
基礎與應用科學 >
資訊科學 |