GloVe is a family of NLP machine learning approaches used to learn word embeddings. Unlike word2vec (which is primarily local), GloVe uses global information in its embeddings, by:
- Computing co-occurrence frequency counts for each word (i.e., frequency of word pairs) across the entire corpus. This is represented as a matrix, where each element denotes the number of times a word appears in the context of a word .
- Optimisation: inner product of word vectors should be a good predictor of co-occurrence frequency.
GloVe embeddings do encode word relationships quite well! But they do have biased relationships learned from the data it’s trained on.
In code
torchtext gives us the ability to load pre-trained GloVe embeddings. The below loads a 6 billion parameter embedding trained on the Wikipedia corpus as of 2014.