https://arxiv.org/abs/2307.02762
2024
The paper "PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations" addresses challenges in evaluating responses generated by large language models (LLMs). Traditional methods often rely on a single, state-of-the-art LLM to judge the quality of responses, which can introduce biases such as self-enhancement and positional bias.
To mitigate these issues, the authors propose the Peer Rank and Discussion (PRD) framework, inspired by peer evaluation techniques in educational psychology. This framework comprises two main components:
Peer Rank (PR): This algorithm considers each LLM's pairwise preferences across all answer pairs to produce a final ranking of models. By incorporating the judgments of multiple peer LLMs, PR aims to reduce individual biases and achieve a more accurate evaluation.
Peer Discussion (PD): In this approach, two LLMs engage in a multi-turn discussion to reach a consensus on the preference between two answers. This interactive process seeks to provide a more nuanced and agreed-upon evaluation by leveraging the collective reasoning of multiple models.
The authors conducted experiments on two benchmark datasets to assess the effectiveness of the PRD framework. The results indicate that both PR and PD achieve higher accuracy and better alignment with human judgments compared to traditional evaluation methods. Notably, the Peer Rank algorithm was able to produce a relatively accurate self-ranking of models even when the identities of the models were anonymized.
This research contributes to the field by introducing a novel evaluation framework that leverages peer interactions among LLMs, offering a promising direction for more reliable and unbiased assessments of language model outputs.