Unlock: Reinforcement Learning from Human Feedback
The RLHF pipeline as math, not folklore: Bradley-Terry with sycophancy and intransitivity, KL-shielded PPO with overoptimization mitigation, the DPO implicit-reward identity and its likelihood-displacement failure, online-vs-offline preference learning, Nash-LHF, and the LIMA challenge to whether you need RL at all.
264 Prerequisites0 Mastered0 Working202 Gaps
Prerequisite mastery23%
Recommended probe
Floating-Point Arithmetic is your weakest prerequisite with available questions. You haven't been assessed on this topic yet.
Not assessed3 questions
Graph Neural NetworksAdvanced
Not assessed4 questions
Numerical Linear AlgebraFoundations
Not assessed1 question
No quiz
Policy Gradient TheoremAdvanced
Not assessed8 questions
Not assessed1 question
RLHF and AlignmentResearch
Not assessed3 questions
No quiz
Sign in to track your mastery and see personalized gap analysis.