Research | Notion

I asked chatGPT, “What are the best methods to evaluate LLM models?” There was nothing new in the answer that it gave me, but I understood how big of a concept evaluation is. I will focus on whether the answer to the model in question is correct. The evaluator should ensure the answer is correct, as a teacher does.

ChatGPT answer

When using simpler Machine learning models, evaluating the model’s performance is easier. You can usually then use a train, validation and test data set to evaluate your model performance. You can also use methods like Exact Match to see how your model performs. However, when using pre-trained LLMs, you fine-tune them for your purpose; you most likely don’t have access to the data the model was trained on. The answers are so complex that, for example, the Exact Match is a really bad metric.

LLM eval, Josh Tobin

https://www.youtube.com/watch?v=2CIIQ5KZWUM

According to Josh Tobin, the best way in 2024 is to use another LLM to evaluate your LLM for bulk evaluation and, more importantly, for sensitive evaluations, human evaluation. LLMs are easy to use for evaluation; they are cost-efficient compared to humans but are not always reliable. LLM favours their LLMs, so, e.g. openAIs LLM gives better results to its own LLM than Anthropic. Also, LLMs tend to favour long answers, which is not always better.

Human evaluation takes a lot of time and costs a lot. When using an LLM, the thumbs-up or thumbs-down options are not a good way to get feedback. Also, that feedback is really generic. If you hire a team of people to evaluate the answers you have fine-tuned for your LLM, you better be rich.

There have also been tests where you have several LLMs to evaluate one LLM. Depending on the LLM, this tends also to be costly.

Different Evaluation methods

https://zhengdongwang.com/2024/12/29/2024-letter.html

MMLU is a test that tests if an LLM understands the question, not the performance of the LLM

Summary of MMLU paper
Contextual memory

The Needle in the Haystack Test
MMMU

MMMU
OpenAI own evaluation

GPT-o1 eval
AIME

AI system optimization via Multiple LLM Evaluators, AIME
RE-bench

Evaluating frontier AI R&D capabilities of language model agents against human experts

Many of these tests above are tailored for a specific use case

“In fact, you might even say that the only time AI researchers are doing AI research is when they choose the evaluation. The rest of the time, they’re just optimizing a number.”