Clustering based Two-Stage Text Classification Requiring Minimal Training Data

Xue Zhang^{1, 2} and Wang-xin Xiao^{3, 4}

Department of Physics, Shangqiu Normal University
Shangqiu 476000, China
Key Laboratory of High Confidence Software Technologies, Ministry of Education, Peking University
Beijing100871, China
jane_zhang@pku.edu.cn
School of Traffic and Transportation Engineering, Changsha University of Science and Technology
Changsha 410114, China
Key Laboratory for Road Structure & Material of the Ministry of Transport
Beijing 100088, China
wx.xiao@rioh.cn

Abstract

Clustering has been employed to expand training data in some semi-supervised learning methods. Clustering based methods are based on the assumption that the learned clusters under the guidance of initial training data can somewhat characterize the underlying distribution of the data set. However, our experiments show that whether such assumption holds is based on both the separability of the considered data set and the size of the training data set. It is often violated on data set of bad separability, especially when the initial training data are too few. In this case, clustering based methods would perform worse. In this paper, we propose a clustering based two-stage text classification approach to address the above problem. In the first stage, labeled and unlabeled data are first clustered with the guidance of the labeled data. Then a self-training style clustering strategy is used to iteratively expand the training data under the guidance of an oracle or expert. At the second stage, discriminative classifiers can subsequently be trained with the expanded labeled data set. Unlike other clustering based methods, the proposed clustering strategy can effectively cope with data of bad separability. Furthermore, our proposed framework converts the challenging problem of sparsely labeled text classification into a supervised one, therefore, supervised classification models, e.g. SVM, can be applied, and techniques proposed for supervised learning can be used to further improve the classification accuracy, such as feature selection, sampling methods and data editing or noise filtering. Our experimental results demonstrated the effectiveness of our proposed approach especially when the size of the training data set is very small.

Key words

text classification, clustering, active semi-supervised clustering, two-stage classification

Digital Object Identifier (DOI)

https://doi.org/10.2298/CSIS120130044Z

Publication information

Volume 9, Issue 4 (December 2012)
Special Issue on Recent Advances in Systems and Informatics
Year of Publication: 2012
ISSN: 2406-1018 (Online)
Publisher: ComSIS Consortium

Full text

Download Available in PDF
Portable Document Format

How to cite

Zhang, X., Xiao, W.: Clustering based Two-Stage Text Classification Requiring Minimal Training Data. Computer Science and Information Systems, Vol. 9, No. 4, 1627-1644. (2012), https://doi.org/10.2298/CSIS120130044Z