AI system optimization via Multiple LLM Evaluators, AIME

Key Contributions:

Identification of Single Evaluator Limitations: The authors highlight that relying on a single LLM evaluator in complex tasks, such as code generation, often results in undetected errors and suboptimal performance.
Theoretical Framework: They propose that an optimal evaluation policy can be approximated through a linear combination of multiple evaluators, each focusing on distinct criteria.
AIME Protocol: AIME utilizes multiple LLMs, each independently assessing specific criteria (e.g., correctness, readability, efficiency) and combines their evaluations to guide the optimization process.

Empirical Findings:

Enhanced Error Detection: AIME achieved up to a 62% higher error detection rate compared to single LLM evaluation protocols on datasets like LeetCodeHard and HumanEval.
Improved Success Rates: The protocol demonstrated up to a 16% increase in success rates over single evaluator methods in code generation tasks.
Impact of Evaluator Selection: The study found that the number and selection of evaluators significantly affect performance, with success rates varying by up to 12% based on these factors.

Conclusion:

The research suggests that employing multiple specialized LLM evaluators can substantially improve AI system optimization, particularly in complex tasks requiring multifaceted evaluations. The findings advocate for a shift from single to multiple evaluator protocols to achieve more robust and accurate AI system outputs.