In NLP, the idea with vector embedding (word embedding) is that words can be embedded with one-hot encoded vectors. This allows models to understand the meaning of the words, based on how they’re encoded sequentially.

As usual, we pass a word through an encoder to create a low-dimensional embedding. Because the meaning of a word depends on its context (i.e., words that appear nearby), our decoder outputs to nearby words.

We can use a self-supervised objective (such as predicting the next word/token) to learn embeddings over tokens. Models like word2vec and GloVe learn static embeddings, with one embedding for all senses. RNN/transformer based models learn contextual embeddings, where the embedding of the same word changes according to the sentence it appears in.

Some commonly used models are:

Distance measures

The distance between vectors in the embedding space helps us describe which words may have a similar embedding.

The L2 norm of the vector gives the Euclidean distance within the embedding space:

The cosine similarity gives the cosine of the angle between embeddings. This is invariant to the magnitude.

And in code, with PyTorch:

torch.norm(glove['cat'] - glove['dog']) # sub with your vector of choice
torch.cosine_similarity(
						glove['cat'].unsqueeze(0),
						glove['dog'].unsqueeze(0)
)