Skip to main content
Theorem
Path
Curriculum
Paths
Labs
Diagnostic
Case Study
Blog
Search
Sign in
Quiz Hub
/
RLHF and Alignment
RLHF and Alignment
3 selected
Difficulty 4-6
3 unseen
View topic
Intermediate
New
0 answered
3 intermediate
Adapts to your performance
Question 1 of 3
120s
intermediate (4/10)
compare
Standard RLHF (as used for InstructGPT and early ChatGPT) has three stages. Which sequence is correct?
Hide and think first
A.
Supervised fine-tuning on demonstrations; reward model training on preferences; RL fine-tuning (PPO) against the reward model
B.
Unsupervised embedding alignment; contrastive pretraining; reward optimization
C.
RL pre-training from scratch; supervised fine-tuning; reward model filtering
D.
Reward modeling on base outputs; supervised fine-tuning; RL against reward
Submit Answer
I don't know