In NLP, the idea with vector embedding (word embedding) is that words can be embedded with one-hot encoded vectors. This allows models to understand the meaning of the words, based on how they’re encoded sequentially.
As usual, we pass a word through an encoder to create a low-dimensional embedding. Because the meaning of a word depends on its context (i.e., words that appear nearby), our decoder outputs to nearby words.
We can use a self-supervised objective (such as predicting the next word/token) to learn embeddings over tokens. Models like word2vec and GloVe learn static embeddings, with one embedding for all senses. RNN/transformer based models learn contextual embeddings, where the embedding of the same word changes according to the sentence it appears in.
Some commonly used models are:
Distance measures
The distance between vectors in the embedding space helps us describe which words may have a similar embedding.
The L2 norm of the vector gives the Euclidean distance within the embedding space:
The cosine similarity gives the cosine of the angle between embeddings. This is invariant to the magnitude.
And in code, with PyTorch: