I asked chatGPT, “What are the best methods to evaluate LLM models?” There was nothing new in the answer that it gave me, but I understood how big of a concept evaluation is. I will focus on whether the answer to the model in question is correct. The evaluator should ensure the answer is correct, as a teacher does.
When using simpler Machine learning models, evaluating the model’s performance is easier. You can usually then use a train, validation and test data set to evaluate your model performance. You can also use methods like Exact Match to see how your model performs. However, when using pre-trained LLMs, you fine-tune them for your purpose; you most likely don’t have access to the data the model was trained on. The answers are so complex that, for example, the Exact Match is a really bad metric.
https://www.youtube.com/watch?v=2CIIQ5KZWUM
According to Josh Tobin, the best way in 2024 is to use another LLM to evaluate your LLM for bulk evaluation and, more importantly, for sensitive evaluations, human evaluation. LLMs are easy to use for evaluation; they are cost-efficient compared to humans but are not always reliable. LLM favours their LLMs, so, e.g. openAIs LLM gives better results to its own LLM than Anthropic. Also, LLMs tend to favour long answers, which is not always better.
Human evaluation takes a lot of time and costs a lot. When using an LLM, the thumbs-up or thumbs-down options are not a good way to get feedback. Also, that feedback is really generic. If you hire a team of people to evaluate the answers you have fine-tuned for your LLM, you better be rich.
There have also been tests where you have several LLMs to evaluate one LLM. Depending on the LLM, this tends also to be costly.
https://zhengdongwang.com/2024/12/29/2024-letter.html
MMLU is a test that tests if an LLM understands the question, not the performance of the LLM
Contextual memory
MMMU
OpenAI own evaluation
AIME
RE-bench
Evaluating frontier AI R&D capabilities of language model agents against human experts
Many of these tests above are tailored for a specific use case
“In fact, you might even say that the only time AI researchers are doing AI research is when they choose the evaluation. The rest of the time, they’re just optimizing a number.”