Related papers: ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping

ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping

URL: http://arxiv.org/abs/2510.08457v1
Date: Thu, 09 Oct 2025 17:03:28 GMT
Title: ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping
Authors: Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, Nanyun Peng,
Abstract summary: We propose ARES, a unified framework for adaptive reasoning that dynamically allocates exploration effort based on task difficulty.<n>Our approach is motivated by two key empirical findings: (i) while single-token entropy is noisy, high window-entropy (HWE) tokens can reliably capture reasoning-critical moments.<n>In the Adaptive Cold-Start stage, we curate multimodal and textual data paired with reasoning traces of length proportional to problem difficulty, equipping the model with initial difficulty awareness.<n>In the second stage, we develop Adaptive Entropy Policy Optimization (AEPO), which uses HWE tokens as exploration triggers
Score: 54.37497695483689
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in multimodal large reasoning models (MLRMs) have substantially improved their ability to solve complex textual and visual tasks. However, these models tend to overthink on simple problems, producing unnecessarily lengthy reasoning traces, while under-exploring on challenging ones, leading to missed solutions. To address this imbalance, we propose ARES, a unified open-source framework for adaptive reasoning that dynamically allocates exploration effort based on task difficulty. Our approach is motivated by two key empirical findings: (i) while single-token entropy is noisy, high window-entropy (HWE) tokens (token-level entropies averaged under a sliding window) can reliably capture reasoning-critical moments; and (ii) reducing HWE usage benefits easy problems, while increasing it is essential for solving hard ones. Building on these insights, ARES introduces a two-stage training pipeline. In the Adaptive Cold-Start stage, we curate multimodal and textual data paired with reasoning traces of length proportional to problem difficulty, equipping the model with initial difficulty awareness. In the second stage, we develop Adaptive Entropy Policy Optimization (AEPO), which uses HWE tokens as exploration triggers to decide when to explore, and a hierarchical entropy reward with dynamic KL control to decide how much to explore. Extensive experiments demonstrate that ARES achieves superior performance and reasoning efficiency across diverse mathematical, logical, and multimodal benchmarks, while closing the gap to leading commercial systems under significantly lower inference costs.

Related papers

SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent [39.43590030917357]
SIGHT is a framework that enhances search-based reasoning through Self-Evidence Support and Information-Gain Driven Diverse Branching.<n>SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states.<n> Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches.
arXiv Detail & Related papers (2026-02-12T04:16:55Z)
DeepCompress: A Dual Reward Strategy for Dynamically Exploring and Compressing Reasoning Chains [56.708381920156256]
Large Reasoning Models (LRMs) have demonstrated impressive capabilities but suffer from cognitive inefficiencies like overthinking'' simple problems and underthinking'' complex ones.<n>This paper introduces textbfDeepCompress, a novel framework that simultaneously enhances both the accuracy and efficiency of LRMs.
arXiv Detail & Related papers (2025-10-31T12:13:11Z)
DiffAdapt: Difficulty-Adaptive Reasoning for Token-Efficient LLM Inference [68.05879215304641]
Recent reasoning Large Language Models (LLMs) demonstrate remarkable problem-solving abilities but often generate long thinking traces whose utility is unclear.<n>Our work aims to improve their efficiency, enabling them to reach high performance without overthinking.<n>We introduce textbfDiffAdapt, a lightweight framework that selects Easy/Normal/Hard inference strategies per question based on their difficulty and reasoning trace entropy.
arXiv Detail & Related papers (2025-10-22T15:16:06Z)
Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning [55.59724323303857]
We propose a framework that balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment.<n>Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.
arXiv Detail & Related papers (2025-10-13T03:10:26Z)
From Long to Short: LLMs Excel at Trimming Own Reasoning Chains [48.692414597960244]
O1/R1 style large reasoning models (LRMs) signal a substantial leap forward over conventional instruction-following LLMs.<n>Recent studies show that LRMs are prone to suffer from overthinking.<n>We propose a test-time scaling method, EDIT, which efficiently guides LRMs to identify the shortest correct reasoning paths at test time.
arXiv Detail & Related papers (2025-09-07T19:00:44Z)
GRADE: Generating multi-hop QA and fine-gRAined Difficulty matrix for RAG Evaluation [5.002953635224383]
Retrieval-Augmented Generation (RAG) systems are widely adopted in knowledge-intensive NLP tasks.<n>Current evaluations often overlook the structural complexity and multi-step reasoning required in real-world scenarios.<n>We propose textscGRADE, a novel evaluation framework that models task difficulty along two dimensions.
arXiv Detail & Related papers (2025-08-23T11:26:41Z)
Light-IF: Endowing LLMs with Generalizable Reasoning via Preview and Self-Checking for Complex Instruction Following [10.119219532863767]
lazy reasoning during the thinking stage is the primary factor contributing to poor instruction adherence.<n>We propose a comprehensive framework designed to enable rigorous reasoning processes involving preview and self-checking.<n>Our Light-IF-32B model surpasses both larger open-source models such as DeepSeek-R1 and closed-source models like Doubao-1.6.
arXiv Detail & Related papers (2025-08-05T07:42:00Z)
Decomposing the Entropy-Performance Exchange: The Missing Keys to Unlocking Effective Reinforcement Learning [106.68304931854038]
Reinforcement learning with verifiable rewards (RLVR) has been widely used for enhancing the reasoning abilities of large language models (LLMs)<n>We conduct a systematic empirical analysis of the entropy-performance exchange mechanism of RLVR across different levels of granularity.<n>Our analysis reveals that, in the rising stage, entropy reduction in negative samples facilitates the learning of effective reasoning patterns.<n>In the plateau stage, learning efficiency strongly correlates with high-entropy tokens present in low-perplexity samples and those located at the end of sequences.
arXiv Detail & Related papers (2025-08-04T10:08:10Z)
TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs [50.820065021136024]
DeepSeek R1 has significantly advanced complex reasoning for large language models (LLMs)<n>Recent methods have attempted to replicate R1's reasoning capabilities in multimodal settings.<n>We propose TACO, a novel reinforcement learning algorithm for visual reasoning.
arXiv Detail & Related papers (2025-05-27T06:30:48Z)
Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning [75.04643265875072]
Large reasoning models (LRMs) have demonstrated strong performance on complex reasoning tasks, but often suffer from overthinking.<n>Inspired by the dual process theory in cognitive science, we propose Adaptive Cognition Policy Optimization.<n>ACPO enables LRMs to achieve efficient reasoning through adaptive cognitive allocation and dynamic system switch.
arXiv Detail & Related papers (2025-05-22T07:15:08Z)
Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning [69.64809103333839]
We investigate how explicitly modeling problem's difficulty prior information shapes the effectiveness of reinforcement learning based fine-tuning for multimodal reasoning.<n>Our approach demonstrates significant performances across various multi-modal mathematical reasoning benchmarks with only 2K+0.6K two-stage training data.
arXiv Detail & Related papers (2025-05-19T15:43:10Z)
a1: Steep Test-time Scaling Law via Environment Augmented Generation [45.19240207975418]
Environment Augmented Generation (EAG) is a framework that enhances large language models' reasoning through real-time environmental feedback.<n>EAG enables deliberate backtracking and strategic replanning through tight integration of execution feedback and branching exploration.<n>A1-32B model achieves state-of-the-art performance among similar-sized models across all benchmarks.
arXiv Detail & Related papers (2025-04-20T12:55:59Z)
A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond [88.5807076505261]
Large Reasoning Models (LRMs) have demonstrated strong performance gains by scaling up the length of Chain-of-Thought (CoT) reasoning during inference.<n>A growing concern lies in their tendency to produce excessively long reasoning traces.<n>This inefficiency introduces significant challenges for training, inference, and real-world deployment.
arXiv Detail & Related papers (2025-03-27T15:36:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.