https://mmmu-benchmark.github.io/
2023 (2024 MMMU-pro)
The blog post introduces MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark), a comprehensive evaluation framework designed to assess large language models (LLMs) and multimodal models on complex, college-level tasks requiring advanced knowledge and reasoning.
Key Features of MMMU:
- Extensive Dataset: Comprises 11.5K meticulously collected multimodal questions sourced from college exams, quizzes, and textbooks. These questions span six core disciplines:
- Art & Design
- Business
- Science
- Health & Medicine
- Humanities & Social Science
- Tech & Engineering
- Diverse Subject Coverage: Encompasses 30 subjects and 183 subfields, featuring 30 highly heterogeneous image types, including charts, diagrams, maps, tables, music sheets, and chemical structures.
- Evaluation Focus: Assesses models on three essential skills:
- Perception: Ability to process and understand information across different modalities.
- Knowledge: Possession of subject-specific information necessary for task comprehension.
- Reasoning: Capability to apply deliberate reasoning with domain-specific knowledge to derive solutions.
Challenges Highlighted by MMMU:
- Models are required to perform tasks akin to those faced by experts, demanding advanced perception and reasoning with domain-specific knowledge.
- The benchmark presents challenges due to the requirement for both expert-level visual perceptual abilities and deliberate reasoning with subject-specific knowledge.
Performance Insights:
- Evaluations of 14 open-source LMMs and proprietary models like GPT-4V(ision) indicate substantial challenges posed by MMMU.
- Advanced models such as GPT-4V have achieved accuracies around 56%, indicating significant room for improvement.
Significance:
- MMMU aims to stimulate the development of next-generation multimodal foundation models toward expert artificial general intelligence (AGI).
- By focusing on advanced perception and reasoning with domain-specific knowledge, MMMU challenges models to perform tasks akin to those faced by human experts, pushing the boundaries of AI capabilities.