A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization

Kang Hyuk Lee; Judy Kay; Byeong Ho Kang; Uwe Rosebrock

doi:10.1007/3-540-45683-X_48

A Comparative Study on Statistical Machine Learning Algorithms and Thresholding Strategies for Automatic Text Categorization

Kang Hyuk Lee, Judy Kay, Byeong Ho Kang, Uwe Rosebrock

Źródło

Lecture Notes in Computer Science > PRICAI 2002: Trends in Artificial Intelligence > Document Analysis and Categorization > 444-453

Abstrakt

Two main research areas in statistical text categorization are similarity-based learning algorithms and associated thresholding strategies. The combination of these techniques significantly influences the overall performance of text categorization. After investigating two similarity-based classifiers (k-NN and Rocchio) and three common thresholding techniques (RCut, PCut, and SCut), we describe a new learning algorithm known as the keyword association network (KAN) and a new thresholding strategy (RinSCut) to improve performance over existing techniques. Extensive experiments have been conducted on the Reuters-21578 and 20-Newsgroups data sets. The experimental results show that our new approaches give better results for both micro-averaged F ₁ and macro-averaged F ₁ scores.