On-Policy Expert Corrections (OEC) for Robust Multi-Turn LM Agents
Scale Labs has proposed On-Policy Expert Corrections (OEC), a lightweight adaptation of DAgger, to improve the training efficiency and robustness of multi-turn large language model (LLM) agents, particularly in software engineering tasks. OEC addresses covariate shift, a common problem where training data differs from real-world interaction data. The method involves starting a trajectory with a student model, then switching to an expert model to complete it, and then using the expert portion for supervised fine-tuning.
OEC offers a practical solution to train more robust and efficient LLM agents by combining the benefits of imitation learning and reinforcement learning, especially for long-horizon tasks.
Learn one new AI thing every day.
Daily Deck sends you seven plain-English cards like this every morning. Free.
Start free