Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena | Notion

https://arxiv.org/abs/2306.05685

24th December 2023

Key Highlights

The paper addresses the challenges in evaluating large language models (LLMs) for chat assistants, particularly for capturing human preferences that traditional benchmarks may miss.

Proposed Benchmarks

MT-Bench:
- Comprises 80 high-quality multi-turn questions.
- Evaluates a chatbot's ability in:
  - Writing
  - Roleplay
  - Information extraction
  - Reasoning
  - Math
  - Coding
  - Knowledge in STEM and humanities.
Chatbot Arena:
- A crowdsourced platform for head-to-head chatbot comparisons.
- Users pose the same question to two anonymous chatbots and vote for the preferred response.
- Captures real-world user preferences.

Insights on LLM-as-a-Judge

Strong LLMs (e.g., GPT-4) are proposed as judges to assess other models on open-ended questions.
Identified biases in the LLM-as-a-judge method:
- Position bias: Preference for the position of an answer.
- Verbosity bias: Preference for longer responses.
- Self-enhancement bias: Favoring its own responses.
- Limited reasoning ability: Inaccuracies in complex judgments.

Bias Mitigation

Solutions to reduce biases were proposed.
Results show that strong LLM judges agree with human preferences over 80% of the time, matching human-human agreement levels.

Hybrid Evaluation Framework

Combines:
- Traditional capability-based benchmarks.
- Preference-based benchmarks using LLM-as-a-judge.
Enables:
- Swift and automated evaluations.
- Assessment of both technical capabilities and alignment with human preferences.

Public Resources

Released resources include:
- MT-Bench questions.
- 3,000 expert votes.
- 30,000 conversations with user preferences.
- All datasets are made publicly available for research.

Conclusion

The paper demonstrates that LLM-as-a-judge is a scalable, explainable, and effective method for approximating human preferences.