A Hybrid Algorithm for Web Document Clustering Based on Frequent Term Sets and k-Means

Le Wang; Li Tian; Yan Jia; Weihong Han

doi:10.1007/978-3-540-72909-9_20

A Hybrid Algorithm for Web Document Clustering Based on Frequent Term Sets and k-Means

Źródło

Lecture Notes in Computer Science > Advances in Web and Network Technologies, and Information Management > Stream Data Management > 198-203

Abstrakt

In order to conquer the major challenges of current web document clustering, i.e. huge volume of documents, high dimensional process and understandability of the cluster, we propose a simple hybrid algorithm (SHDC) based on top-k frequent term sets and k-means. Top-k frequent term sets are used to produce k initial means, which are regarded as initial clusters and further refined by k-means. The final optimal clustering is returned by k-means while the understandable description of clustering is provided by k frequent term sets. Experimental results on two public datasets indicate that SHDC outperforms other two representative clustering algorithms (the farthest first k-means and random initial k-means) both on efficiency and effectiveness.