Analysis of training data using clustering to improve semi-supervised self-training

N. Piroonsup; S. Sinthupinyo

doi:10.1016/j.knosys.2017.12.006

Analysis of training data using clustering to improve semi-supervised self-training

N. Piroonsup, S. Sinthupinyo

Source

Knowledge-Based Systems > 2018 > 143 > C > 65-80

Abstract

Applying unlabeled data in semi-supervised self-training can significantly improve the accuracy of a supervised classifier, but in some cases, it may dramatically decrease the classification accuracy. One reason for such degradation is a lack of labeled data for training an initial classifier in the self-training process. In this paper, we propose a method to determine the sufficiency of the labeled data and two methods to improve the labeled dataset in the insufficient portion. To determine the sufficiency of labeled data, we apply a semi-supervised cluster technique to estimate the labeled data distribution over the training set. The results show that the accuracy obtained from the final classifiers in clusters without labeled data is markedly lower than that obtained from clusters with labeled data. The two methods we propose for improving the labeled dataset are active labeling and co-labeling, for ensuring the sufficiency of labeled data. Comparison experiments on UCI and real-world datasets show that the proposed methods are an effective preprocessing step for determining and obtaining a sufficient quantity of labeled data, which is essential for attaining accuracy in a semi-supervised self-training classifier.