A word-embedding model for text analytics

Word2vec is one of the most popular implementations of word embedding. It is used to create a distributed representation of words into numerical vectors. Word2vec converts text into vectors that capture semantics and relationships among words.  An example of semantics is how the relationship between Italy and  Rome is similar to the relationship between France and Paris, so Italy – Rome + Paris ≈ France.

Text Analytics Workflow: Using word2vec to Convert Text into Numbers

A typical text analytics workflow includes preprocessing, converting text into numbers, and model building. Word embedding, such as word2vec, is one of the popular approaches for converting text into numbers. Other approaches for converting text into numbers are:

The advantage of word2vec over other methods is its ability to recognize similar words. Word embeddings such as word2vec have shown better accuracy in many text analytics applications.

Word Embedding Alternatives to word2vec

In addition to word2vec, other popular implementations of word embedding are GloVe and FastText. The difference among these implementations is the type of algorithm used and the initial text corpus for training to create the model. Word2vec uses continuous bag-of-words (CBOW) and skip-gram algorithms for training the initial text corpus.

You can use an existing pretrained word embedding model such as word2vec in your workflow. Alternatively, you can create your own word embedding model. Some things to consider are:

  • Pretrained models, like word2vec, make it easy to get started but may lack domain-specific words needed for a high-accuracy text analytics application.
  • Creating a custom model is more time-consuming, but a custom model may perform better in domain-specific applications.

You can also include a pretrained word embedding layer, such as word2vec, in a deep learning network and continue training it for specific applications.

Text Analytics Toolbox™, for use with MATLAB®, has functions to read word embeddings produced by word2vec, GloVe, and FastText with the wordEmbedding object.

To learn more about using word2vec and building models with text data, see Text Analytics Toolbox.

See also: natural language processing, sentiment analysis, text mining with MATLAB, data science, deep learning, Deep Learning Toolbox™, Statistics and Machine Learning Toolbox™, Predictive Maintenance Toolbox™