https://arxiv.org/abs/2410.03131
Key Contributions:
- Identification of Single Evaluator Limitations: The authors highlight that relying on a single LLM evaluator in complex tasks, such as code generation, often results in undetected errors and suboptimal performance.
- Theoretical Framework: They propose that an optimal evaluation policy can be approximated through a linear combination of multiple evaluators, each focusing on distinct criteria.
- AIME Protocol: AIME utilizes multiple LLMs, each independently assessing specific criteria (e.g., correctness, readability, efficiency) and combines their evaluations to guide the optimization process.
Empirical Findings:
- Enhanced Error Detection: AIME achieved up to a 62% higher error detection rate compared to single LLM evaluation protocols on datasets like LeetCodeHard and HumanEval.
- Improved Success Rates: The protocol demonstrated up to a 16% increase in success rates over single evaluator methods in code generation tasks.
- Impact of Evaluator Selection: The study found that the number and selection of evaluators significantly affect performance, with success rates varying by up to 12% based on these factors.
Conclusion:
The research suggests that employing multiple specialized LLM evaluators can substantially improve AI system optimization, particularly in complex tasks requiring multifaceted evaluations. The findings advocate for a shift from single to multiple evaluator protocols to achieve more robust and accurate AI system outputs.