← Library · Advanced concept

Reinforcement Learning from Human Feedback (RLHF)

RLHF is a training methodology where a reinforcement learning agent learns to align its behavior with human preferences by optimizing for a reward signal derived from human judgments. Instead of hand-crafting a reward function, humans provide feedback on the agent's actions or outputs, which then guides the learning process. This is particularly effective for open-ended tasks where objective metrics are difficult to define.

In plain terms

It's like teaching a dog tricks, not by giving it a specific treat every time, but by showing it which behaviors you like more or less.

Why it matters

RLHF is key to making AI models, especially large language models, more helpful, harmless, and honest according to human values.

Learn one new AI thing every day.

Daily Deck sends you seven plain-English cards like this every morning. Free.

Start free