Distributed Information-Theoretic Metric Learning in Apache Spark

Yuxin Su; Haiqin Yang; Irwin King; Michael Lyu

doi:10.1109/IJCNN.2016.7727622

Distributed Information-Theoretic Metric Learning in Apache Spark

Su, Yuxin, Yang, Haiqin, King, Irwin, Lyu, Michael

Source

2016 International Joint Conference on Neural Networks (IJCNN) > 3306 - 3313

Abstract

Distance metric learning (DML) is an effective similarity learning tool to learn a distance function from examples to enhance the model performance in applications of classification, regression, and ranking, etc. Most DML algorithms need to learn a Mahalanobis matrix, a positive semidefinite matrix that scales quadratically with the number of dimensions of input data. This brings huge computational cost in the learning procedure, and makes all proposed algorithms infeasible for extremely high-dimensional data even with the low-rank approximation. Differently, in this paper, we take advantage of the power of parallel computation and propose a novel distributed distance metric learning algorithm based on a state-of-the-art DML algorithm, Information-Theoretic Metric Learning (ITML).More specifically, we utilize the property that each positive semidefinite matrix can be decomposed into a combination of rank-one and trace-one matrices and convert the original sequential training procedure into a parallel one. In most cases, the communication demands of the proposed method are also reduced from O(d²) to O(cd), where d is the number of dimensions of the data and c is the number of constraints in DML and can be smaller than d by appropriate selection. Moreover importantly, we present a rigorous theoretical analysis to upper bound the Bregman divergence between the sequential algorithm and the parallel algorithm, which guarantees the correctness and performance of the proposed algorithm. Our experiments on datasets with O(10⁵) features demonstrate the competitive scalability and the performance compared with the original ITML algorithm.

Identifiers

book e-ISSN :	2161-4407
book e-ISBN :	978-1-5090-0620-5 , 978-1-5090-0619-9
DOI	10.1109/IJCNN.2016.7727622

Authors

Su, Yuxin

Shenzhen Key Laboratory of Rich Media Big Data Analytics and Applications, Shenzhen Research Institute, The Chinese University of Hong Kong, China

Yang, Haiqin

Shenzhen Key Laboratory of Rich Media Big Data Analytics and Applications, Shenzhen Research Institute, The Chinese University of Hong Kong, China

King, Irwin

Shenzhen Key Laboratory of Rich Media Big Data Analytics and Applications, Shenzhen Research Institute, The Chinese University of Hong Kong, China

Lyu, Michael

Shenzhen Key Laboratory of Rich Media Big Data Analytics and Applications, Shenzhen Research Institute, The Chinese University of Hong Kong, China

Keywords

Matrix decomposition Measurement Covariance matrices Approximation algorithms Eigenvalues and eigenfunctions Machine learning algorithms Algorithm design and analysis

Additional information

Data set: ieee

Publisher

IEEE

chapter

Read online
Download
Add to read later
Add to collection
Add to followed
Share

Export to bibliography


Assign to other user
	×
Wrong email address

INFONA - science communication portal

Distributed Information-Theoretic Metric Learning in Apache Spark $("#expandableTitles").expandable();

Source

Abstract

Identifiers

Authors

User assignment

Assignment remove confirmation

You're going to remove this assignment. Are you sure?

Su, Yuxin

Yang, Haiqin

King, Irwin

Lyu, Michael

Keywords

Additional information

Publisher

Share

Export to bibliography

Reporting an error / abuse

Sending the report failed

Accessibility options

Distributed Information-Theoretic Metric Learning in Apache Spark