← Library · Advanced concept

Reinforcement Learning from Human Feedback (RLHF)

RLHF is a training technique that aligns AI models, particularly large language models, with human preferences and values. After an initial training phase, humans provide feedback on the quality or desirability of the model's outputs, often by ranking different generated responses. This human feedback is used to train a reward model, which then guides a reinforcement learning process to adapt the original AI model, making its outputs more aligned with human expectations, safety, and helpfulness.

In plain terms

Imagine teaching a dog tricks, not just by showing it what to do, but by praising it when it performs the trick well and correcting it when it doesn't, allowing the dog to learn preferred behaviors.

Why it matters

RLHF is critical for creating AI systems that are not just performant but also safe, helpful, and aligned with complex human nuances and societal norms.

Learn one new AI thing every day.

Daily Deck sends you seven plain-English cards like this every morning. Free.

Start free

Reinforcement Learning from Human Feedback (RLHF)

Learn one new AI thing every day.

Related advanced concepts