Chain of Simulation: A Dual-Mode Reasoning Framework for Large Language Models with Dynamic Problem Routing
- URL: http://arxiv.org/abs/2602.02842v1
- Date: Mon, 02 Feb 2026 21:44:01 GMT
- Title: Chain of Simulation: A Dual-Mode Reasoning Framework for Large Language Models with Dynamic Problem Routing
- Authors: Saeid Sheikhi,
- Abstract summary: Chain of Simulation (CoS) is a novel dual-mode reasoning framework that dynamically routes problems to specialized reasoning strategies.<n>CoS employs three distinct reasoning modes: computational flow with self-consistency for mathematical problems, symbolic state tracking with representations for spatial reasoning, and hybrid fact-extraction for multi-hop inference.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Chain of Simulation (CoS), a novel dual-mode reasoning framework that dynamically routes problems to specialized reasoning strategies in Large Language Models (LLMs). Unlike existing uniform prompting approaches, CoS employs three distinct reasoning modes: (1) computational flow with self-consistency for mathematical problems, (2) symbolic state tracking with JSON representations for spatial reasoning, and (3) hybrid fact-extraction for multi-hop inference. Through comprehensive evaluation on GSM8K, StrategyQA, and bAbI benchmarks using four state-of-the-art models (Gemma-3 27B, LLaMA-3.1 8B, Mistral 7B, and Qwen-2.5 14B), we demonstrate that CoS achieves 71.5% accuracy on GSM8K (1.0% absolute improvement), 90.0% on StrategyQA (2.5% improvement), and 19.0% on bAbI (65.2% relative improvement) compared to the strongest baselines. The analysis reveals that problem-specific mode selection is crucial, with computational mode achieving 81.2% accuracy when correctly applied to mathematical problems, while misrouting leads to 0% accuracy. We provide detailed algorithms for mode selection, state tracking, and answer extraction, establishing CoS as an effective approach for improving LLM reasoning without additional training. The framework provides superior trade-offs between accuracy and efficiency compared to Self-Consistency, achieving comparable performance at 54% lower computational cost.
Related papers
- ACAR: Adaptive Complexity Routing for Multi-Model Ensembles with Auditable Decision Traces [3.151184728006369]
We present ACAR, a measurement framework for studying multi-model orchestration under auditable conditions.<n>ACAR uses self-consistency variance (sigma) computed from N=3 probe samples to route tasks across single-model, two-model, and three-model execution modes.<n>We evaluate ACAR on 1,510 tasks spanning four benchmarks, producing more than 7,550 auditable runs.
arXiv Detail & Related papers (2026-02-06T23:27:17Z) - PRIME: Policy-Reinforced Iterative Multi-agent Execution for Algorithmic Reasoning in Large Language Models [5.598141218271656]
Large language models have demonstrated remarkable capabilities across diverse reasoning tasks, yet their performance on algorithmic reasoning remains limited.<n>We propose PRIME, a framework comprising three specialized agents, an executor for step-by-step reasoning, a verifier for constraint checking, and a coordinator for backtracking control.<n>For comprehensive evaluation, we introduce PRIME-Bench, the largest algorithmic reasoning benchmark to date, comprising 86 tasks across 12 categories with 51,600 instances.
arXiv Detail & Related papers (2026-01-19T07:57:01Z) - Scaling Trends for Multi-Hop Contextual Reasoning in Mid-Scale Language Models [0.0]
We present a controlled study of multi-hop contextual reasoning in large language models.<n>We show that multi-agent systems show the inverse pattern, achieving up to 80% on reasoning tasks where rule-based methods fail.
arXiv Detail & Related papers (2026-01-06T20:18:55Z) - CoT-X: An Adaptive Framework for Cross-Model Chain-of-Thought Transfer and Optimization [5.857877898558651]
Chain-of-Thought (CoT) reasoning enhances the problem-solving ability of large language models (LLMs) but leads to substantial inference overhead.<n>This paper investigates efficient CoT transfer across models of different scales and architectures through an adaptive reasoning summarization framework.
arXiv Detail & Related papers (2025-11-07T22:35:31Z) - Once Upon an Input: Reasoning via Per-Instance Program Synthesis [19.86168542588911]
We introduce Per-Instance Program Synthesis (PIPS), a method that generates and refines programs at the instance-level using structural feedback.<n>To further improve performance, PIPS incorporates a confidence metric that dynamically chooses between direct inference and program synthesis on a per-instance basis.
arXiv Detail & Related papers (2025-10-26T21:58:33Z) - Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression [68.69801176669843]
We propose an online post-training RL method that prunes redundant steps and estimates difficulty.<n> TRAAC (Think Right with Adaptive, Attentive Compression) achieves an average absolute accuracy gain of 8.4%.<n>Although our models are trained on math datasets, they show accuracy and efficiency gains on out-of-distribution non-math datasets.
arXiv Detail & Related papers (2025-10-02T02:00:20Z) - From Static to Dynamic: Adaptive Monte Carlo Search for Mathematical Process Supervision [49.59309446816251]
Existing methods estimate the quality of reasoning steps based on a fixed-budget sampling strategy.<n>We propose Adaptive Monte Carlo Search (AMCS), a framework that transforms data generation from fixed, static to adaptive.<n>AMCS adaptively refines estimation by allocating more samples to uncertain reasoning steps while using fewer samples for those that are easier to estimate.
arXiv Detail & Related papers (2025-09-29T06:52:35Z) - Learning Adaptive Parallel Reasoning with Language Models [70.1745752819628]
We propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end.<n> APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations.<n>A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures.
arXiv Detail & Related papers (2025-04-21T22:29:02Z) - Inference Scaling vs Reasoning: An Empirical Analysis of Compute-Optimal LLM Problem-Solving [0.0]
Recent advances in large language models (LLMs) have predominantly focused on maximizing accuracy and reasoning capabilities.<n>This paper investigates the potential synergy between reasoning enhancement and computational efficiency by analyzing the integration of two contrasting approaches.
arXiv Detail & Related papers (2024-12-20T08:42:45Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.<n>Existing direct preference learning algorithms are originally designed for the single-turn chat task.<n>We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning.
Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z) - Cumulative Reasoning with Large Language Models [12.267474250936123]
Cumulative Reasoning (CR) is a structured framework that enhances large language models (LLMs) problem-solving.<n>CR orchestrates LLMs in three distinct roles--Proposer, Verifier(s), and Reporter--to systematically decompose tasks, generate and validate intermediate reasoning steps, and compose them into a solution.
arXiv Detail & Related papers (2023-08-08T16:18:20Z) - Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner
Party Transcription [73.66530509749305]
In this paper, we argue that, even in difficult cases, some end-to-end approaches show performance close to the hybrid baseline.
We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline.
arXiv Detail & Related papers (2020-04-22T19:08:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.