Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models

2024 may

The paper "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models" addresses the challenges in assessing the quality of outputs from large language models (LLMs). Traditional evaluation methods, such as BLEU and ROUGE scores, often fall short in capturing the nuances of generative tasks. To overcome these limitations, the authors propose the use of a Panel of LLM evaluators (PoLL) drawn from different model families, rather than relying on a single large judge like GPT-4.

Key Contributions:

Panel of LLM Evaluators (PoLL): Introducing PoLL, which comprises multiple smaller models from diverse model families, to evaluate LLM outputs. This approach aims to reduce intra-model bias and improve correlation with human judgments.
Cost Efficiency: Demonstrating that PoLL is over seven times less expensive than using a single large model like GPT-4 for evaluations, making it a more accessible and scalable solution.
Improved Correlation with Human Judgments: Showing that PoLL correlates better with human evaluations compared to a single large judge, enhancing the reliability of automated assessments.
Reduced Intra-Model Bias: Highlighting that pooling judgments from a heterogeneous panel of models mitigates biases inherent in individual models, leading to more objective evaluations.

Methodology:

The authors conducted experiments across three settings: single-hop question answering (QA), multi-hop QA, and Chatbot Arena, spanning six datasets. They compared the performance of PoLL against a single large judge (GPT-4) in terms of correlation with human judgments and cost efficiency.

Findings:

PoLL outperforms a single large judge in correlating with human judgments across various tasks and datasets. It also exhibits less intra-model bias due to its composition of disjoint model families. Additionally, PoLL is significantly more cost-effective, being over seven times less expensive than using a single large model like GPT-4.

Implications:

The study suggests that employing a diverse panel of smaller LLMs for evaluation purposes can enhance the accuracy, objectivity, and cost-effectiveness of assessing LLM outputs. This approach offers a scalable alternative to relying solely on large, expensive models and addresses concerns related to intra-model bias in evaluations.