Using Word2Vec to process big text data

Long Ma; Yanqing Zhang

doi:10.1109/BigData.2015.7364114

Using Word2Vec to process big text data

Source

2015 IEEE International Conference on Big Data (Big Data) > 2895 - 2897

Abstract

Big data is a broad data set that has been used in many fields. To process huge data set is a time consuming work, not only due to its big volume of data size, but also because data type and structure can be different and complex. Currently, many data mining and machine learning technique are being applied to deal with big data problem; some of them can construct a good learning algorithm in terms of lots of training example. However, considering the data dimension, it will be more efficient if learning algorithm is capable of selecting useful features or decreasing the feature dimension. Word2Vec, proposed and supported by Google, is not an individual algorithm, but it consists of two learning models, Continuous Bag of Words (CBOW) and Skip-gram. By feeding text data into one of learning models, Word2Vec outputs word vectors that can be represented as a large piece of text or even the entire article. In our work, we first training the data via Word2Vec model and evaluated the word similarity. In addition, we clustering the similar words together and use the generated clusters to fit into a new data dimension so that the data dimension is decreased.