TY - GEN
T1 - Investigation of term weighting schemes in classification of imbalanced texts
AU - Naderalvojoud, Behzad
AU - Bozkir, Ahmet Selman
AU - Sezer, Ebru Akcapinar
N1 - Publisher Copyright:
Copyright © 2014 IADIS Press All rights reserved.
PY - 2014
Y1 - 2014
N2 - Class imbalance problem in data, plays a critical role in use of machine learning methods for text classification since feature selection methods expect homogeneous distribution as well as machine learning methods. This study investigates two different kinds of feature selection metrics (one-sided and two-sided) as a global component of term weighting schemes (called as tffs) in scenarios where different complexities and imbalance ratios are available. Traditional term weighting approach (tfidf) is employed as a base line to evaluate the effects of tffs weighting. In fact, this study aims to present which kind of weighting schemes are suitable for which machine learning algorithms on different imbalanced cases. Four classification algorithms are used to indicate the effects of term weighting schemes on the imbalanced datasets. According to our findings, regardless of tfidf, term weighting methods based on one-sided feature selection metrics are better approaches for SVM and k-NN algorithms while two-sided based term weighting methods are the best choice for MultiNB and C4.5 on the imbalanced texts. As a result, the use of term weighting methods based on one-sided feature selection metrics is recommended for SVM and tfidf is suitable weighting method for k-NN algorithm in text classification tasks.
AB - Class imbalance problem in data, plays a critical role in use of machine learning methods for text classification since feature selection methods expect homogeneous distribution as well as machine learning methods. This study investigates two different kinds of feature selection metrics (one-sided and two-sided) as a global component of term weighting schemes (called as tffs) in scenarios where different complexities and imbalance ratios are available. Traditional term weighting approach (tfidf) is employed as a base line to evaluate the effects of tffs weighting. In fact, this study aims to present which kind of weighting schemes are suitable for which machine learning algorithms on different imbalanced cases. Four classification algorithms are used to indicate the effects of term weighting schemes on the imbalanced datasets. According to our findings, regardless of tfidf, term weighting methods based on one-sided feature selection metrics are better approaches for SVM and k-NN algorithms while two-sided based term weighting methods are the best choice for MultiNB and C4.5 on the imbalanced texts. As a result, the use of term weighting methods based on one-sided feature selection metrics is recommended for SVM and tfidf is suitable weighting method for k-NN algorithm in text classification tasks.
KW - Class imbalance problem
KW - Feature selection
KW - Machine learning
KW - Term weighting
KW - Text classification
UR - https://www.scopus.com/pages/publications/84929301519
M3 - Conference contribution
AN - SCOPUS:84929301519
T3 - Proceedings of the European Conference on Data Mining 2014 and International Conferences on Intelligent Systems and Agents 2014 and Theory and Practice in Modern Computing 2014 - Part of the Multi Conference on Computer Science and Information Systems, MCCSIS 2014
SP - 39
EP - 46
BT - Proceedings of the European Conference on Data Mining 2014 and International Conferences on Intelligent Systems and Agents 2014 and Theory and Practice in Modern Computing 2014 - Part of the Multi Conference on Computer Science and Information Systems, MCCSIS 2014
A2 - Roth, Jorg
A2 - Abraham, Ajith P.
A2 - dos Reis, Antonio Palma
PB - IADIS
T2 - European Conference on Data Mining 2014 and International Conferences on Intelligent Systems and Agents 2014 and Theory and Practice in Modern Computing 2014
Y2 - 15 July 2014 through 17 July 2014
ER -