A Parallel Clustering Algorithm Implementation Based on Apache Mahout

Xia Daoping; Zhong Alin; Long Yubo

doi:10.1109/IMCCC.2016.9

Source

2016 Sixth International Conference on Instrumentation & Measurement, Computer, Communication and Control (IMCCC) > 790 - 795

Abstract

K-means clustering is one of the most famous clustering algorithms. It is widely used in many practical applications. K–means clustering is the task of dividing a set of n data points in d-dimensional space into k clusters. The data points in the same cluster are much closer to each other than to those in other clusters according to certain criteria. Traditional k-means clustering proceeds by alternatively executing two steps: assignment step and update step. The assignment step assigns each data point to its nearest cluster. The Euclidean distance is commonly used to measure the distance. The update step calculates the new center of each cluster and updates them. For large-scale dataset, the k-means clustering spends most of its execution time on calculating distances between each data point and existing cluster centers. It is obvious that distance computation for each data point is irrelevant to the others. Therefore these distance calculations can be completed concurrently. In this paper, a simple and efficient implementation of a parallel k-means clustering algorithm is proposed based on the existing mahout API, in order to speed up clustering for large-scale dataset. In addition, the implementation was packaged and can be offered as an easy to use API for developers who can easily accomplish their task without any other configurations. Experimental results revealed a significant improvement in clustering speed for large-scale dataset. It demonstrates the effectiveness and efficiency of the proposed implementation.

Identifiers

book e-ISBN :	978-1-5090-1195-7 , 978-1-5090-1194-0
DOI	10.1109/IMCCC.2016.9

Keywords

Clustering algorithms Algorithm design and analysis Partitioning algorithms Classification algorithms Parallel algorithms Euclidean distance Software algorithms Apache mahout Clustering Parallel k-means clustering

Additional information

Data set: ieee

Publisher

IEEE

INFONA - science communication portal

A Parallel Clustering Algorithm Implementation Based on Apache Mahout

Source

Abstract

Identifiers

Authors

Daoping, Xia

Alin, Zhong

Yubo, Long

Keywords

Additional information

Publisher


Assign to other user
	×
Wrong email address

INFONA - science communication portal

A Parallel Clustering Algorithm Implementation Based on Apache Mahout $("#expandableTitles").expandable();

Source

Abstract

Identifiers

Authors

User assignment

Assignment remove confirmation

You're going to remove this assignment. Are you sure?

Daoping, Xia

Alin, Zhong

Yubo, Long

Keywords

Additional information

Publisher

Share

Export to bibliography

Reporting an error / abuse

Sending the report failed

Accessibility options

A Parallel Clustering Algorithm Implementation Based on Apache Mahout