Thoughts on what I’ve read

MMLU is no longer efficient, and a general test seems outdated when testing an LLM.
- MMMU is similar to MMLU but focuses on multimodal tasks
An effective way to evaluate LLMs is to use a test data set made for the specific task (by humans or AI). This allows you to test and compare several models.
- You must ensure the test data set is good quality and suited for testing the specific model.
- If this is done in-house, the data might tempered for marketing reasons.
When using another LLM to evaluate a model, it should be small and open-source, not a large closed-source model.
AIME

AI system optimization via Multiple LLM Evaluators, AIME
Closed-source LLMs have been used in several cases for evaluation, even if it is debatable whether it is a good way to evaluate LLMs.
It’s more challenging to evaluate LLMs used for general purposes than LLMs used for a specific purpose.
- Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
- Most well-performing general models are closed-source. Although you can read about their performance on the company website, you cannot replicate the test because some parts of the tests aren’t open to the public.
- The more specific purpose a LLM is used for, the easier it is to build a test data set, evaluate with another LLM, and -hopefully - cheaper and easier to do human evaluation.
- When a model is general and trained on data from different fields, I hypothesise that it can start hallucinating more quickly than a model focused on a specific field. This is my gut feeling, and I haven’t yet read anything that would back this up or contradict it.
A human scores better than an AI in long-term projects (> 8h) and worse in short-term projects (< 8h)
- AI is >10x faster than a human.
- Evaluating frontier AI R&D capabilities of language model agents against human experts
A powerful closed-source LLM can work well as an LLM evaluator if you do some tricks and tweaks, especially if used alongside a human evaluator.
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- I think using a large, closed-source LLM as an evaluator is counterproductive when the public can’t evaluate the evaluation. So, you rely only on the company that created the LLM and evaluated it while they make money off the LLM. And what happens if the company changes something in the LLM you use for tests, and suddenly your CI won’t work, and you don’t know why?
- When tweaking and customising things—in this case, LLMs as evaluators—to make them work, there is always a risk that the changes cannot be generalised.
Using several smaller LLMs as a panel to evaluate an LLM rather than just one big evaluation, LLM can give better results and be more cost-efficient.
- Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
When using an LLM as an evaluator, there are three biases:
- PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations
- It prefers answers from its own family.
- It prefers the first answer.
- It prefers verbose and long answers.