ChatGPT answer | Notion

Question: “What are the best methods to evaluate LLM models?”

Answer:

1. Intrinsic Evaluation

This focuses on evaluating the model's performance based on predefined tasks or datasets without human intervention. Common methods include:

Perplexity

Measures how well a model predicts a sequence of words.
Lower perplexity indicates better performance.
Useful for language modeling tasks.

BLEU, ROUGE, and METEOR

Compare generated text to reference text using n-gram overlap.
Commonly used in tasks like translation and summarization.

Exact Match (EM) and F1 Score

EM checks whether the model’s output exactly matches the expected result.
F1 Score balances precision and recall, especially for tasks like question answering.

Log-Likelihood or Probability

Assesses the likelihood the model assigns to the correct outputs or gold-standard labels.

2. Extrinsic Evaluation

This assesses the model's performance in real-world tasks and applications. Examples include:

Task-Specific Accuracy