Review

Transformer

Attention mechanism

\hat{\alpha} = \text{attn}(h_{s}, h_{t-1})

\alpha = \text{softmax}(\hat{\alpha})

h_{s}

: State of the source sequence at s time step

h_{t-1}

: State of the target sequence at t-1 time step

\hat{\alpha}_{s}

: Attention score

\hat{\alpha} =[\hat{\alpha_1},\hat{\alpha_2},\cdots,\hat{\alpha_L}]

attn Funcs Examples

attn(q,k) = \begin{cases} \omega^T tanh(W[q;k]) & \text{MLP} \\ q^TWk & \text{Bilinear} \\ q^Tk & \text{Dot Product} \\ \frac{q^Tk}{\sqrt{d}} & \text{Avoid d (vector dimensions) too large} \\ \end{cases}

Position Information

Position Embedding
Similar to word embeddings
assigning a continuous low-dimensional, dense vector representation to each absolute position in a sequence
Positional Encoding
$f: \mathbb{N} \rightarrow \mathbb{R}^{d}$
Map positional index values onto a d-dimension vector $PosEnc(p, i) = \begin{cases} \sin\left(\frac{p}{10000^{\frac{i}{d}}}\right) & i \text{ is even} \\ \cos\left(\frac{p}{10000^{\frac{i-1}{d}}}\right) & i \text{ is odd} \\ \end{cases}$
p: Position index
$0 \leq i \le d$ : Position vector index

Query, Key, Value

Three different parameter matrices

W^q

W^k

, and

W^V

are used to transform the input vectors

x_i

into the new vectors

Query: $q_i = W^qx_i$
Key: $k_i=W^kx_i$
Value: $v_i = W^vx_i$

\hat{\alpha} =[\hat{\alpha_{i1}},\hat{\alpha_{i2}},\cdots,\hat{\alpha_{iL}}]

New output vectors can be calculated as follows:

y_i = \sum\limits_{j=1}^n \alpha_{ij}v_j \\ \alpha_{ij}=Softmax(\hat{\alpha}_{i})_j \\ \hat{\alpha_{ij}} = attn(q_i, k_j)

Additional tricks

Multi-layer self-attention
Multi-head attention
Combine serveral self-attention mechanisms with different matrices, which are called attention heads

To conclude, it excels in capturing long-range dependencies in sequences
but may require significant computational resources due to the self-attention mechanism

Generative Pre-Training

Unsupervised Pre-training

Use conventional LM to optimize the MLE(maximum likelihood estimation) of the given text sequence $x_{1},x_{2},\cdots,x_{n}$

h^{[0]}=e_{x'}W^{e}+W^{p} \\ h^{[l]}=Transformer-Block(h^{[l-1]}), \forall l \in {1,2,\cdots,L} \\ P(x)=Softmax(h^{[L]}W^{eT}) \\ \mathcal{L}^{\mathrm{PT}}(x)=\sum_i \log P\left(x_i \mid x_{i-k} \cdots x_{i-1} ; \boldsymbol{\theta}\right)

e_{x'}

: One-hot vector of x'

W^{e} \in \mathbb{R}^{|\mathbb{V}|\times d}

: Word vector matrix

W^{p} \in \mathbb{R}^{n \times d}

: Position vector matrix

L

: Number of layers

\boldsymbol{\theta}

: Model parameters, which can be optimized using SGD

Predict the next word based on the previous k(window size) words

Supervised Fine-tuning

D: downstream task annotated data

P(y \mid x)=\operatorname{Softmax}\left(h^{[L]} W^{y}\right) \\ \mathcal{L}^{\mathrm{FT}}(D)=\sum\limits_{(x,y)}\log P(y \mid x_1 \cdots x_n) \\ \mathcal{L}(D)=\mathcal{L}^{\mathrm{PT}}(D)+\lambda \mathcal{L}^{\mathrm{FT}}(D)

usually

{\lambda}

is set to 0.5

Prompt Learning

Transform the data of downstream tasks into natural language forms
Prompt Engineering; Anwser Engineering

Instruction Learning

From “Finetuned Language Models Are Zero-Shot Learners”
Code: FLAN~~dre~~
LM learns to follow instructions even for unseen tasks

Generate hard tokens for each task
Fine-tune on several full-shot tasks
Evaluate on zero-shot tasks

InstructGPT

Suggest Three goals of LM:

helpful
honest
harmless

Train the RM by making training samples through instruction learning

Instruct the training of RL model through RM scoring

Methodology

Collect demonstration data and SFT
A pre-trained model is fine-tuned on the data using supervised learning
Reward Model Training
Ranking-based or discriminative labeling
To evaluate the LM performance based on human feedback
RL via PPO
Use the output of SFT Model as the input of RM to score
Optimize the parameter by RL
Human Feedback and Labeling
Performance Mitigation

Models

Reward Model

Use pairwise ranking loss

loss(\theta) = -\frac{1}{C_k^2}E_{(x,y_w,y_l)~D} [\log \sigma(r_{\theta}(x,y_w)-r_{\theta}(x,y_l))]

r_{\theta}(x,y)

: scalar output of the RM for prompt x and completion y with parameters

\theta

y_w

: preferred completion out of the pair of

y_w

and

y_l

D

: Comparison dataset
Choose k = 9, one batch 36(

C_9^2

) pairs

RL, PPO

SFT Model is furether fine-tuned using Proximal Policy Optimization in a bandit env
Use RM to score the output of the SFT Model
A per-token KL penalty from the SFT Model is used to mitigate the overoptimization of the reward model

objective(\phi) = E_{(x,y) \sim D_{\pi_{\phi}^{RL}}} \left[ r_{\theta}(x,y) - \beta \log \left( \pi_{\phi}^{RL} \left( y \mid x \right) / \pi^{SFT} \left( y \mid x \right)\right) \right] + \gamma E_{x \sim D_{PT}} \left[ \log (\pi_{\phi}^{RL}(x)) \right ]

D_{PT}

: Pretraining distribution

\pi^{SFT}

: Supervised trained model

\pi_{\phi}^{RL}

: Policy of the RL mode, from

\pi^{SFT}

\beta

: KL reward coefficient, control KL penalty strength

\gamma

: Pretraining loss coefficient,0 for PPO, the larger, the more like GPT-3

Explaination

Input prompt x into an RL model to generate output y, then feed y into a pre-trained RM model to obtain a score
Max score means the output generated by the RL model ranks highest among human rankings
Penalty: as the model uopdates and data distribution differences increase,
use KL divergence in the loss func to evaluate the difference between two probability distributions
Objective function of GPT-3 : prevent from only performing well on human ranking results

Conclusion

PPO-PTX

PPO Objective Func(Fine-tuned on new annotated data)
GPT-3 Objective Func(Pre-trained on large-scale data)

Reference

Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” Advances in neural information processing systems 35 (2022): 27730-27744