Review

Transformer

Attention mechanism

α^=attn(hs,ht1)\hat{\alpha} = \text{attn}(h_{s}, h_{t-1})
α=softmax(α^)\alpha = \text{softmax}(\hat{\alpha})
hsh_{s}: State of the source sequence at s time step
ht1h_{t-1}: State of the target sequence at t-1 time step
α^s\hat{\alpha}_{s}: Attention score
α^=[α1^,α2^,,αL^]\hat{\alpha} =[\hat{\alpha_1},\hat{\alpha_2},\cdots,\hat{\alpha_L}]

attn Funcs Examples

attn(q,k)={ωTtanh(W[q;k])MLPqTWkBilinearqTkDot ProductqTkdAvoid d (vector dimensions) too largeattn(q,k) = \begin{cases} \omega^T tanh(W[q;k]) & \text{MLP} \\ q^TWk & \text{Bilinear} \\ q^Tk & \text{Dot Product} \\ \frac{q^Tk}{\sqrt{d}} & \text{Avoid d (vector dimensions) too large} \\ \end{cases}

Position Information

  1. Position Embedding
    Similar to word embeddings
    assigning a continuous low-dimensional, dense vector representation to each absolute position in a sequence

  2. Positional Encoding

    f:NRdf: \mathbb{N} \rightarrow \mathbb{R}^{d}
    Map positional index values onto a d-dimension vector
    PosEnc(p,i)={sin(p10000id)i is evencos(p10000i1d)i is oddPosEnc(p, i) = \begin{cases} \sin\left(\frac{p}{10000^{\frac{i}{d}}}\right) & i \text{ is even} \\ \cos\left(\frac{p}{10000^{\frac{i-1}{d}}}\right) & i \text{ is odd} \\ \end{cases}

    p: Position index

    0id0 \leq i \le d : Position vector index

Query, Key, Value

Three different parameter matrices

WqW^q, WkW^k, and WVW^V are used to transform the input vectors xix_i into the new vectors

Query: qi=Wqxiq_i = W^qx_i
Key: ki=Wkxik_i=W^kx_i
Value: vi=Wvxiv_i = W^vx_i

α^=[αi1^,αi2^,,αiL^]\hat{\alpha} =[\hat{\alpha_{i1}},\hat{\alpha_{i2}},\cdots,\hat{\alpha_{iL}}]

New output vectors can be calculated as follows:

yi=j=1nαijvjαij=Softmax(α^i)jαij^=attn(qi,kj)y_i = \sum\limits_{j=1}^n \alpha_{ij}v_j \\ \alpha_{ij}=Softmax(\hat{\alpha}_{i})_j \\ \hat{\alpha_{ij}} = attn(q_i, k_j)

Additional tricks

  1. Multi-layer self-attention
  2. Multi-head attention
    Combine serveral self-attention mechanisms with different matrices, which are called attention heads

To conclude, it excels in capturing long-range dependencies in sequences
but may require significant computational resources due to the self-attention mechanism

Generative Pre-Training

Unsupervised Pre-training

Use conventional LM to optimize the MLE(maximum likelihood estimation) of the given text sequence x1,x2,,xnx_{1},x_{2},\cdots,x_{n}

h[0]=exWe+Wph[l]=TransformerBlock(h[l1]),l1,2,,LP(x)=Softmax(h[L]WeT)LPT(x)=ilogP(xixikxi1;θ)h^{[0]}=e_{x'}W^{e}+W^{p} \\ h^{[l]}=Transformer-Block(h^{[l-1]}), \forall l \in {1,2,\cdots,L} \\ P(x)=Softmax(h^{[L]}W^{eT}) \\ \mathcal{L}^{\mathrm{PT}}(x)=\sum_i \log P\left(x_i \mid x_{i-k} \cdots x_{i-1} ; \boldsymbol{\theta}\right)
exe_{x'}: One-hot vector of x'
WeRV×dW^{e} \in \mathbb{R}^{|\mathbb{V}|\times d}: Word vector matrix
WpRn×dW^{p} \in \mathbb{R}^{n \times d}: Position vector matrix
LL: Number of layers
θ\boldsymbol{\theta}: Model parameters, which can be optimized using SGD

Predict the next word based on the previous k(window size) words

Supervised Fine-tuning

D: downstream task annotated data

P(yx)=Softmax(h[L]Wy)LFT(D)=(x,y)logP(yx1xn)L(D)=LPT(D)+λLFT(D)P(y \mid x)=\operatorname{Softmax}\left(h^{[L]} W^{y}\right) \\ \mathcal{L}^{\mathrm{FT}}(D)=\sum\limits_{(x,y)}\log P(y \mid x_1 \cdots x_n) \\ \mathcal{L}(D)=\mathcal{L}^{\mathrm{PT}}(D)+\lambda \mathcal{L}^{\mathrm{FT}}(D)
usually λ{\lambda} is set to 0.5

Prompt Learning

Transform the data of downstream tasks into natural language forms
Prompt Engineering; Anwser Engineering

Instruction Learning

From “Finetuned Language Models Are Zero-Shot Learners”
Code: FLANdre
LM learns to follow instructions even for unseen tasks

  1. Generate hard tokens for each task
  2. Fine-tune on several full-shot tasks
  3. Evaluate on zero-shot tasks

InstructGPT

Suggest Three goals of LM:

  • helpful
  • honest
  • harmless

Train the RM by making training samples through instruction learning

Instruct the training of RL model through RM scoring

Methodology

  1. Collect demonstration data and SFT
    A pre-trained model is fine-tuned on the data using supervised learning

  2. Reward Model Training
    Ranking-based or discriminative labeling
    To evaluate the LM performance based on human feedback

  3. RL via PPO
    Use the output of SFT Model as the input of RM to score
    Optimize the parameter by RL

  4. Human Feedback and Labeling

  5. Performance Mitigation

Models

Reward Model

Use pairwise ranking loss

loss(θ)=1Ck2E(x,yw,yl) D[logσ(rθ(x,yw)rθ(x,yl))]loss(\theta) = -\frac{1}{C_k^2}E_{(x,y_w,y_l)~D} [\log \sigma(r_{\theta}(x,y_w)-r_{\theta}(x,y_l))]
rθ(x,y)r_{\theta}(x,y): scalar output of the RM for prompt x and completion y with parameters θ\theta
ywy_w: preferred completion out of the pair of ywy_w and yly_l
DD: Comparison dataset
Choose k = 9, one batch 36(C92C_9^2) pairs

RL, PPO

SFT Model is furether fine-tuned using Proximal Policy Optimization in a bandit env
Use RM to score the output of the SFT Model
A per-token KL penalty from the SFT Model is used to mitigate the overoptimization of the reward model

objective(ϕ)=E(x,y)DπϕRL[rθ(x,y)βlog(πϕRL(yx)/πSFT(yx))]+γExDPT[log(πϕRL(x))]objective(\phi) = E_{(x,y) \sim D_{\pi_{\phi}^{RL}}} \left[ r_{\theta}(x,y) - \beta \log \left( \pi_{\phi}^{RL} \left( y \mid x \right) / \pi^{SFT} \left( y \mid x \right)\right) \right] + \gamma E_{x \sim D_{PT}} \left[ \log (\pi_{\phi}^{RL}(x)) \right ]
DPTD_{PT} : Pretraining distribution
πSFT\pi^{SFT}: Supervised trained model
πϕRL\pi_{\phi}^{RL}: Policy of the RL mode, from πSFT\pi^{SFT}
β\beta: KL reward coefficient, control KL penalty strength
γ\gamma: Pretraining loss coefficient,0 for PPO, the larger, the more like GPT-3

Explaination

  1. Input prompt x into an RL model to generate output y, then feed y into a pre-trained RM model to obtain a score
    Max score means the output generated by the RL model ranks highest among human rankings

  2. Penalty: as the model uopdates and data distribution differences increase,
    use KL divergence in the loss func to evaluate the difference between two probability distributions

  3. Objective function of GPT-3 : prevent from only performing well on human ranking results

Conclusion

PPO-PTX

  • PPO Objective Func(Fine-tuned on new annotated data)
  • GPT-3 Objective Func(Pre-trained on large-scale data)

Reference

Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” Advances in neural information processing systems 35 (2022): 27730-27744