← Library · Advanced concept

Concept Whitening

Concept whitening is a technique used in interpretability, particularly for neural networks, to disentangle the internal representations (features or 'concepts') learned by a model. It transforms a set of learned features into a new set where each feature is independent and has unit variance, making it easier to identify and understand the distinct high-level concepts the model has captured. This process aims to make the internal workings of the model more transparent and interpretable by aligning neurons with meaningful, orthogonal concepts.

In plain terms

Imagine a jumbled palette of mixed colors, and concept whitening sorts them into distinct, pure colors, each separate and clearly identifiable.

Why it matters

It improves the interpretability of complex AI models by making their internal learned concepts more distinct and easier to analyze.

Learn one new AI thing every day.

Daily Deck sends you seven plain-English cards like this every morning. Free.

Start free