https://arxiv.org/pdf/2404.18796

2024 may

The paper "Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models" addresses the challenges in assessing the quality of outputs from large language models (LLMs). Traditional evaluation methods, such as BLEU and ROUGE scores, often fall short in capturing the nuances of generative tasks. To overcome these limitations, the authors propose the use of a Panel of LLM evaluators (PoLL) drawn from different model families, rather than relying on a single large judge like GPT-4.

Key Contributions:

Methodology:

Findings:

Implications:

The study suggests that employing a diverse panel of smaller LLMs for evaluation purposes can enhance the accuracy, objectivity, and cost-effectiveness of assessing LLM outputs. This approach offers a scalable alternative to relying solely on large, expensive models and addresses concerns related to intra-model bias in evaluations.