BERT

Bidirectional Encoder Representations from Transformers (BERT) describes a transformer-based large language model architecture. It uses a transformer model trained with two self-supervised tasks.

The first self-supervised task is masked word prediction. At random it replaces 15% of words with the [mask] token. Using the context of non-masked words, we can predict the original value of the [mask] token. The loss is computed on just the masked word.

The second task is next sentence prediction. Given two sentences, we predict if they appear together. We want 50/50 positive/negative pairs of sentences ( $< 512$ ).

Transfer learning

One good thing is it shows great transfer learning capabilities, and it’s currently being used in Google search to represent user queries and documents.

It uses self-supervised training on large amounts of text, which allows it to grasp certain patterns in language, using a certain task (predicting the masked word). From here, we apply standard transfer learning steps by using supervised learning on a specific task with a labelled dataset.

jszhn

Recent Notes

Accounting method

Adjugate matrix

Algorithm

Algorithmic analysis

Alma Linux

BERT

Transfer learning

Graph View

Backlinks