BERT

TLDR

Intro

Downstream Task Strategies

feature based
- The feature-based approach, such as ELMo (Peters et al., 2018a), uses task-specific architectures that include the pre-trained representations as additional features.
fine-tuning
- Generative Pre-trained Transformer (OpenAI GPT) (Radford et al., 2018), introduces minimal task-specific parameters, and is trained on the downstream tasks by simply fine-tuning all pre-trained parameters.

Unsupervised Feature-based Approaches

Word -> Sentence -> Context

Unsupervised Fine-tuning Approaches

More recently, sentence or document encoders which produce contextual token representations have been pre-trained from unlabeled text and fine-tuned for a supervised downstream task

There are two steps in our framework: pre-training and fine-tuning.
For fine-tuning, the BERT model is first initialized with the pre-trained parameters, and all the parameters are fine-tuned using labeled data from the downstream tasks.

In this work, we denote the number of layers (i.e., Transformer blocks) as L, the hidden size as H, and the number of self-attention heads as A. We primarily report results on two model sizes: BERT_BASE (L=12, H=768, A=12, Total Parameters=110M) and BERT_LARGE (L=24, H=1024, A=16, Total Parameters=340M).

Input/Output Representations

To make BERT handle a variety of down-stream tasks, our input representation is able to unambiguously represent both a single sentence and a pair of sentences (e.g., 〈 Question, Answer 〉) in one token sequence.

As shown in Figure 1, we denote input embedding as E, the final hidden vector of the special [CLS] token as

, and the final hidden vector for the ith input token as

3.1 Pre-training BERT

Task #1: Masked LM

In order to train a deep bidirectional representation, we simply mask some percentage of the input tokens at random, and then predict those masked tokens.

Task #2: Next Sentence Prediction (NSP)

Paper Reading Help

BERT

TLDR

Intro

Downstream Task Strategies

Unsupervised Feature-based Approaches

Unsupervised Fine-tuning Approaches

BERT

3.1 Pre-training BERT

3.2 Fine-tuning

BERT

TLDR

Intro

Downstream Task Strategies

Related Work

Unsupervised Feature-based Approaches

Unsupervised Fine-tuning Approaches

BERT

3.1 Pre-training BERT

3.2 Fine-tuning