https://arxiv.org/abs/2306.05685
24th December 2023
Key Highlights
- The paper addresses the challenges in evaluating large language models (LLMs) for chat assistants, particularly for capturing human preferences that traditional benchmarks may miss.
Proposed Benchmarks
- MT-Bench:
- Comprises 80 high-quality multi-turn questions.
- Evaluates a chatbot's ability in:
- Writing
- Roleplay
- Information extraction
- Reasoning
- Math
- Coding
- Knowledge in STEM and humanities.
- Chatbot Arena:
- A crowdsourced platform for head-to-head chatbot comparisons.
- Users pose the same question to two anonymous chatbots and vote for the preferred response.
- Captures real-world user preferences.
Insights on LLM-as-a-Judge
- Strong LLMs (e.g., GPT-4) are proposed as judges to assess other models on open-ended questions.
- Identified biases in the LLM-as-a-judge method:
- Position bias: Preference for the position of an answer.
- Verbosity bias: Preference for longer responses.
- Self-enhancement bias: Favoring its own responses.
- Limited reasoning ability: Inaccuracies in complex judgments.
Bias Mitigation
- Solutions to reduce biases were proposed.
- Results show that strong LLM judges agree with human preferences over 80% of the time, matching human-human agreement levels.
Hybrid Evaluation Framework
- Combines:
- Traditional capability-based benchmarks.
- Preference-based benchmarks using LLM-as-a-judge.
- Enables:
- Swift and automated evaluations.
- Assessment of both technical capabilities and alignment with human preferences.
Public Resources
- Released resources include:
- MT-Bench questions.
- 3,000 expert votes.
- 30,000 conversations with user preferences.
- All datasets are made publicly available for research.
Conclusion
- The paper demonstrates that LLM-as-a-judge is a scalable, explainable, and effective method for approximating human preferences.