Training Robust Multi-Turn LM Agents with On-Policy Expert Corrections
Scale Labs introduced 'On-Policy Expert Corrections' (OEC), a method to train robust multi-turn language model agents. OEC, an adaptation of DAgger, combines the benefits of imitation learning and reinforcement learning by having a student model start a trajectory, then an expert model completes it, inheriting the student's history. This approach addresses covariate shift in long-horizon LLM agents, especially in software engineering tasks, and can be implemented on existing agent scaffolds. The research found that OEC significantly outperforms pure behavioral cloning and that the quality of training trajectories, beyond just verifiable rewards, is crucial for stable training.
This method offers a practical solution to improve the efficiency and robustness of multi-turn LLM agents, particularly in complex domains like software engineering, by providing more in-distribution training data and mitigating covariate shift.
Learn one new AI thing every day.
Daily Deck sends you seven plain-English cards like this every morning. Free.
Start free