https://openai.com/index/learning-to-reason-with-llms/
https://arxiv.org/pdf/2305.20050
The paper titled "Let's Verify Step by Step" investigates methods to enhance the reliability of large language models (LLMs) in performing complex multi-step reasoning tasks, particularly in mathematical problem-solving.
Key Contributions:
- Comparison of Supervision Methods: The authors compare two distinct methods for training reward models:
- Outcome Supervision: Provides feedback based solely on the final result of the reasoning process.
- Process Supervision: Offers feedback for each intermediate reasoning step, allowing for more precise identification and correction of errors.
- Empirical Evaluation: The study demonstrates that process supervision significantly outperforms outcome supervision in training models to solve problems from the challenging MATH dataset. The process-supervised model achieved a 78% success rate on a representative subset of the MATH test set.
- Active Learning Enhancement: Incorporating active learning strategies further improved the efficacy of process supervision, enhancing data efficiency by 2.6 times.
- PRM800K Dataset Release: To support further research, the authors have released PRM800K, a comprehensive dataset containing 800,000 step-level human feedback labels used to train their reward model.
Conclusion:
The findings suggest that providing feedback at each step of the reasoning process enables LLMs to develop more accurate and reliable problem-solving capabilities, particularly in complex mathematical tasks. This step-by-step verification approach holds promise for improving the alignment and performance of AI systems in domains requiring intricate reasoning.