Scale Labs Introduces On-Policy Expert Corrections (OEC) for Robust Multi-Turn LM Agents
Scale Labs unveiled On-Policy Expert Corrections (OEC), a method to train more robust multi-turn language model agents by addressing covariate shift. OEC, a lightweight adaptation of DAgger, involves rolling out a student model, then switching to an expert model to complete the trajectory, with the expert inheriting the student's history. This approach generates in-distribution data and learns from expert actions, combining the benefits of imitation learning and reinforcement learning, and significantly improves performance over pure behavioral cloning for software engineering agents.
OEC provides a practical, efficient, and robust solution for training multi-turn LLM agents, especially in complex, long-horizon tasks like software engineering, by mitigating issues like covariate shift.
Learn one new AI thing every day.
Daily Deck sends you seven plain-English cards like this every morning. Free.
Start free