LLM evaluation

This project is for me to learn about and improve LLM evaluation. I was working on a mathematics chatbot and tried different model evaluation libraries (LightEval Hugging Face and DeepEval Confidant AI before the 12th of December 2024), which didn’t work as expected. After this, I started reading research papers and trying different evaluation methods. I have summarised the research in this Notion and added my thoughts. There is a link to a GitHub repo where I’m trying different approaches to evaluating the mathematics chatbot.

This page contains my research, notes and thoughts on the project. If you have any thoughts or comments, message me on LinkedIn.

Github: https://github.com/minettebrink/llm_eval

LinkedIn: https://www.linkedin.com/in/minette-kaunismäki-8b138b166/

X: @MinetteKaum

Research

Thoughts on what I’ve read

Summary of what I read and thoughts

If using LLMs to evaluate, it is better to use several smaller LLMs to evaluate an LLM than a big one.
Avoid giving scores with an LLM as an evaluator; rank models instead.
How does one define intelligence and build a data set that can test intelligence?
- When a test dataset is used to evaluate big general models (like GPT-4), it reminds me of Principia Mathematica but for intelligence. About 20 years later, it was shown that Principia Mathematica was incomplete by Gödel’s incompleteness theorem.
- I believe using data sets like MMLU, MMMU, and ARC will result in the same incompleteness as Principia Mathematica.
- I want to emphasise that this only applies to large general models that try to mimic intelligence. When using smaller models built for a specific task, I don’t think the same issue will be with the test dataset.
An exact match is a good evaluator when evaluating models where the output needs to be correct in a specific form, such as mathematics.
- This is only for the mathematical part of the answer. Exact match is not a good evaluator for explanations related to mathematics. Using an LLM as an evaluator could be a good solution for evaluating the explanations.