InstructGPT
Review
Transformer
Attention mechanism
: State of the source sequence at s time step
: State of the target sequence at t-1 time step
: Attention score
attn Funcs Examples
Position Information
Position Embedding
Similar to word embeddings
assigning a continuous low-dimensional, dense vector representation to each absolute position in a sequencePositional Encoding
Map positional index values onto a d-dimension vectorp: Position index
: Position vector index
Query, Key, Value
Three different parameter matrices
, , and are used to transform the input vectors into the new vectorsQuery:
Key:
Value:
New output vectors can be calculated as follows:
Additional tricks
- Multi-layer self-attention
- Multi-head attention
Combine serveral self-attention mechanisms with different matrices, which are called attention heads
To conclude, it excels in capturing long-range dependencies in sequences
but may require significant computational resources due to the self-attention mechanism
Generative Pre-Training
Unsupervised Pre-training
Use conventional LM to optimize the MLE(maximum likelihood estimation) of the given text sequence
: Word vector matrix
: Position vector matrix
: Number of layers
: Model parameters, which can be optimized using SGD
Predict the next word based on the previous k(window size) words
Supervised Fine-tuning
D: downstream task annotated data
Prompt Learning
Transform the data of downstream tasks into natural language forms
Prompt Engineering; Anwser Engineering
Instruction Learning
From “Finetuned Language Models Are Zero-Shot Learners”
Code: FLANdre
LM learns to follow instructions even for unseen tasks
- Generate hard tokens for each task
- Fine-tune on several full-shot tasks
- Evaluate on zero-shot tasks
InstructGPT
Suggest Three goals of LM:
- helpful
- honest
- harmless
Train the RM by making training samples through instruction learning
Instruct the training of RL model through RM scoring
Methodology
Collect demonstration data and SFT
A pre-trained model is fine-tuned on the data using supervised learningReward Model Training
Ranking-based or discriminative labeling
To evaluate the LM performance based on human feedbackRL via PPO
Use the output of SFT Model as the input of RM to score
Optimize the parameter by RLHuman Feedback and Labeling
Performance Mitigation
Models
Reward Model
Use pairwise ranking loss
: preferred completion out of the pair of and
: Comparison dataset
Choose k = 9, one batch 36() pairs
RL, PPO
SFT Model is furether fine-tuned using Proximal Policy Optimization in a bandit env
Use RM to score the output of the SFT Model
A per-token KL penalty from the SFT Model is used to mitigate the overoptimization of the reward model
: Supervised trained model
: Policy of the RL mode, from
: KL reward coefficient, control KL penalty strength
: Pretraining loss coefficient,0 for PPO, the larger, the more like GPT-3
Explaination
Input prompt x into an RL model to generate output y, then feed y into a pre-trained RM model to obtain a score
Max score means the output generated by the RL model ranks highest among human rankingsPenalty: as the model uopdates and data distribution differences increase,
use KL divergence in the loss func to evaluate the difference between two probability distributionsObjective function of GPT-3 : prevent from only performing well on human ranking results
Conclusion
PPO-PTX
- PPO Objective Func(Fine-tuned on new annotated data)
- GPT-3 Objective Func(Pre-trained on large-scale data)
Reference
Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” Advances in neural information processing systems 35 (2022): 27730-27744