Related papers: StyleBench: Evaluating thinking styles in Large Language Models

StyleBench: Evaluating thinking styles in Large Language Models

URL: http://arxiv.org/abs/2509.20868v1
Date: Thu, 25 Sep 2025 08:00:39 GMT
Title: StyleBench: Evaluating thinking styles in Large Language Models
Authors: Junyu Guo, Shangding Gu, Ming Jin, Costas Spanos, Javad Lavaei,
Abstract summary: We introduce StyleBench, a comprehensive benchmark for evaluating reasoning styles across diverse tasks and models.<n>We assess five representative reasoning styles, including Chain of Thought (CoT), Tree of Thought (ToT), Algorithm of Thought (AoT), Sketch of Thought (SoT) and Chain-of-Draft (CoD)<n>Our large-scale analysis reveals that no single style is universally optimal.
Score: 19.324830531710024
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The effectiveness of Large Language Models (LLMs) is heavily influenced by the reasoning strategies, or styles of thought, employed in their prompts. However, the interplay between these reasoning styles, model architecture, and task type remains poorly understood. To address this, we introduce StyleBench, a comprehensive benchmark for systematically evaluating reasoning styles across diverse tasks and models. We assess five representative reasoning styles, including Chain of Thought (CoT), Tree of Thought (ToT), Algorithm of Thought (AoT), Sketch of Thought (SoT), and Chain-of-Draft (CoD) on five reasoning tasks, using 15 open-source models from major families (LLaMA, Qwen, Mistral, Gemma, GPT-OSS, Phi, and DeepSeek) ranging from 270M to 120B parameters. Our large-scale analysis reveals that no single style is universally optimal. We demonstrate that strategy efficacy is highly contingent on both model scale and task type: search-based methods (AoT, ToT) excel in open-ended problems but require large-scale models, while concise styles (SoT, CoD) achieve radical efficiency gains on well-defined tasks. Furthermore, we identify key behavioral patterns: smaller models frequently fail to follow output instructions and default to guessing, while reasoning robustness emerges as a function of scale. Our findings offer a crucial roadmap for selecting optimal reasoning strategies based on specific constraints, we open source the benchmark in https://github.com/JamesJunyuGuo/Style_Bench.

Related papers

Plan before Solving: Problem-Aware Strategy Routing for Mathematical Reasoning with LLMs [49.995906301946]
Existing methods usually leverage a fixed strategy to guide Large Language Models (LLMs) to perform mathematical reasoning.<n>Our analysis reveals that the single strategy cannot adapt to problem-specific requirements and thus overlooks the trade-off between effectiveness and efficiency.<n>We propose Planning and Routing through Instance-Specific Modeling (PRISM), a novel framework that decouples mathematical reasoning into two stages: strategy planning and targeted execution.
arXiv Detail & Related papers (2025-09-29T07:22:41Z)
Prompting Strategies for Language Model-Based Item Generation in K-12 Education: Bridging the Gap Between Small and Large Language Models [5.584522240405349]
This study explores automatic generation (AIG) using language models to create multiple choice questions (MCQs) for morphological assessment.<n>We evaluated seven structured prompting strategies, including zero-shot, few-shot, chain-of-thought, role-based, sequential, and combinations.<n>Results show that structured prompting, especially strategies combining chain-of-thought and sequential design, significantly improved Gemma's outputs.
arXiv Detail & Related papers (2025-08-27T18:54:32Z)
JT-Math: A Multi-Stage Framework for Advanced Mathematical Reasoning in Large Language Models [15.054016881151254]
We introduce JT-Math-8B, a series of open-source models comprising base, instruct, and thinking versions.<n>The Instruct Model is optimized for direct, concise answers through Supervised Fine-Tuning (SFT) and a GRPO-based reinforcement learning (RL) method.<n>The Thinking Model is trained for complex problem-solving using a Long Chain-of-Thought (Long CoT) approach, combining SFT with a novel, multi-stage RL curriculum.
arXiv Detail & Related papers (2025-07-26T02:45:10Z)
Fractional Reasoning via Latent Steering Vectors Improves Inference Time Compute [60.151643048803145]
We propose Fractional Reasoning, a framework that enables continuous control over reasoning intensity at inference time.<n>Our method operates by extracting the latent steering vector associated with deeper reasoning and reapplying it with a tunable scaling factor.<n> Experiments on GSM8K, MATH500, and GPQA demonstrate that Fractional Reasoning consistently improves performance across diverse reasoning tasks and models.
arXiv Detail & Related papers (2025-06-18T21:15:59Z)
The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think [81.38614558541772]
We introduce the CoT Encyclopedia, a framework for analyzing and steering model reasoning.<n>Our method automatically extracts diverse reasoning criteria from model-generated CoTs.<n>We show that this framework produces more interpretable and comprehensive analyses than existing methods.
arXiv Detail & Related papers (2025-05-15T11:31:02Z)
Ada-R1: Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization [86.56120216550232]
We propose a novel two-stage framework for adaptive and efficient reasoning.<n>First, we construct a hybrid reasoning model by merging long and short CoT models.<n>Second, we apply bi-level preference training to guide the model to select suitable reasoning styles.
arXiv Detail & Related papers (2025-04-30T14:01:45Z)
Visualizing Thought: Conceptual Diagrams Enable Robust Combinatorial Planning in LMMs [59.66595230543127]
Conceptual diagrams (e.g., a sketch drawn by a human to aid reasoning) externalize mental models.<n>Large Language Models (LLMs) and Large MultiModal Models (LMMs) predominantly reason through text.<n>We propose Visual Thinking, a zero-shot framework that enables LMMs to reason through multiple chains of (self-generated) conceptual diagrams.
arXiv Detail & Related papers (2025-03-14T18:27:02Z)
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.<n>We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z)
Large Language Models Are Reasoning Teachers [9.290757451344673]
Fine-tune-CoT is a method that generates reasoning samples from very large teacher models to fine-tune smaller models. We find that Fine-tune-CoT enables substantial reasoning capability in small models, far outperforming prompt-based baselines and even the teacher model in many tasks.
arXiv Detail & Related papers (2022-12-20T08:24:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.