DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities
- URL: http://arxiv.org/abs/2502.17807v1
- Date: Tue, 25 Feb 2025 03:29:53 GMT
- Title: DocPuzzle: A Process-Aware Benchmark for Evaluating Realistic Long-Context Reasoning Capabilities
- Authors: Tianyi Zhuang, Chuqiao Kuang, Xiaoguang Li, Yihua Teng, Jihao Wu, Yasheng Wang, Lifeng Shang,
- Abstract summary: We present DocPuzzle, a rigorously constructed benchmark for evaluating long-context reasoning capabilities in large language models (LLMs)<n>This benchmark comprises 100 expert-level QA problems requiring multi-step reasoning over long real-world documents.<n>We introduce an innovative evaluation framework that mitigates guessing bias through checklist-guided process analysis.
- Score: 39.68147391225923
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present DocPuzzle, a rigorously constructed benchmark for evaluating long-context reasoning capabilities in large language models (LLMs). This benchmark comprises 100 expert-level QA problems requiring multi-step reasoning over long real-world documents. To ensure the task quality and complexity, we implement a human-AI collaborative annotation-validation pipeline. DocPuzzle introduces an innovative evaluation framework that mitigates guessing bias through checklist-guided process analysis, establishing new standards for assessing reasoning capacities in LLMs. Our evaluation results show that: 1)Advanced slow-thinking reasoning models like o1-preview(69.7%) and DeepSeek-R1(66.3%) significantly outperform best general instruct models like Claude 3.5 Sonnet(57.7%); 2)Distilled reasoning models like DeepSeek-R1-Distill-Qwen-32B(41.3%) falls far behind the teacher model, suggesting challenges to maintain the generalization of reasoning capabilities relying solely on distillation.
Related papers
- FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through Reflective Puzzle Solving [90.88021670297664]
FINEREASON is a logic-puzzle benchmark for evaluation of large language models' reasoning capabilities.
We introduce two tasks: state checking, and state transition, for a comprehensive evaluation of how models assess the current situation and plan the next move.
We show that models trained on our state checking and transition data demonstrate gains in math reasoning by up to 5.1% on GSM8K.
arXiv Detail & Related papers (2025-02-27T16:23:25Z) - RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises [41.39610589639382]
We present RuozhiBench, a dataset containing 677 carefully curated questions that contain various forms of deceptive reasoning.<n>We evaluate 17 large language models (LLMs) from 5 Series over RuozhiBench using both open-ended and two-choice formats.<n>LLMs showed limited ability to detect and reason correctly about logical fallacies, with even the best-performing model, Claude-3-haiku, achieving only 62% accuracy compared to the human of more than 90%.
arXiv Detail & Related papers (2025-02-18T18:47:11Z) - Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.<n>Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.<n>This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z) - LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [103.0226977561914]
We propose a comprehensive framework for advancing step-by-step visual reasoning in large language models.<n>We introduce a visual reasoning benchmark specifically designed to evaluate multi-step reasoning tasks.<n>Second, we propose a novel metric that assesses visual reasoning quality at the granularity of individual steps.<n>Third, we present a new multimodal visual reasoning model, named LlamaV-o1, trained using a multi-step curriculum learning approach.
arXiv Detail & Related papers (2025-01-10T18:59:51Z) - EQUATOR: A Deterministic Framework for Evaluating LLM Reasoning with Open-Ended Questions. # v1.0.0-beta [2.1249213103048414]
We introduce the EQUATOR Evaluator, which combines deterministic scoring with a focus on factual accuracy and robust reasoning assessment.<n>Using a vector database, EQUATOR pairs open-ended questions with human-evaluated answers, enabling more precise and scalable evaluations.<n>Our results demonstrate that this framework significantly outperforms traditional multiple-choice evaluations while maintaining high accuracy standards.
arXiv Detail & Related papers (2024-12-31T03:56:17Z) - A NotSo Simple Way to Beat Simple Bench [0.0]
This paper presents a novel framework for enhancing reasoning capabilities in large language models (LLMs)<n>We propose a multi-step prompting strategy coupled with global consistency checks to improve model accuracy and robustness.<n>Our results reveal model-specific strengths: Claude excels in maintaining logical consistency, while GPT-4o exhibits exploratory creativity but struggles with ambiguous prompts.
arXiv Detail & Related papers (2024-12-12T16:04:31Z) - A Comparative Study on Reasoning Patterns of OpenAI's o1 Model [69.08287909042421]
We show that OpenAI's o1 model has achieved the best performance on most datasets.
We also provide a detailed analysis on several reasoning benchmarks.
arXiv Detail & Related papers (2024-10-17T15:09:03Z) - TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles [2.8839090723566296]
TurtleBench collects real user guesses from our online Turtle Soup Puzzle platform.
TurtleBench includes 1,532 user guesses along with the correctness of guesses after annotation.
We thoroughly evaluated nine of the most advanced Large Language Models available today.
arXiv Detail & Related papers (2024-10-07T17:58:47Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.