Related papers: ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs

ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs

URL: http://arxiv.org/abs/2503.12918v1
Date: Mon, 17 Mar 2025 08:29:04 GMT
Title: ThinkPatterns-21k: A Systematic Study on the Impact of Thinking Patterns in LLMs
Authors: Pengcheng Wen, Jiaming Ji, Chi-Min Chan, Juntao Dai, Donghai Hong, Yaodong Yang, Sirui Han, Yike Guo,
Abstract summary: We conduct a comprehensive analysis of the impact of various thinking types on model performance.<n>We introduce ThinkPatterns-21k, a curated dataset comprising 21k instruction-response pairs.<n>We have two key findings: (1) smaller models (30B parameters) can benefit from most of structured thinking patterns, while larger models (32B) with structured thinking like decomposition would degrade performance.
Score: 15.798087244817134
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have demonstrated enhanced performance through the \textit{Thinking then Responding} paradigm, where models generate internal thoughts before final responses (aka, System 2 thinking). However, existing research lacks a systematic understanding of the mechanisms underlying how thinking patterns affect performance across model sizes. In this work, we conduct a comprehensive analysis of the impact of various thinking types on model performance and introduce ThinkPatterns-21k, a curated dataset comprising 21k instruction-response pairs (QA) collected from existing instruction-following datasets with five thinking types. For each pair, we augment it with five distinct internal thinking patterns: one unstructured thinking (monologue) and four structured variants (decomposition, self-ask, self-debate and self-critic), while maintaining the same instruction and response. Through extensive evaluation across different model sizes (3B-32B parameters), we have two key findings: (1) smaller models (<30B parameters) can benefit from most of structured thinking patterns, while larger models (32B) with structured thinking like decomposition would degrade performance and (2) unstructured monologue demonstrates broad effectiveness across different model sizes. Finally, we released all of our datasets, checkpoints, training logs of diverse thinking patterns to reproducibility, aiming to facilitate further research in this direction.

Related papers

From Thinking to Output: Chain-of-Thought and Text Generation Characteristics in Reasoning Language Models [10.38327947136263]
This paper proposes a novel framework for analyzing the reasoning characteristics of four cutting-edge large reasoning models.<n>A diverse dataset consists of real-world scenario-based questions covering logical deduction, causal inference, and multi-step problem-solving.<n>The research results uncover various patterns of how these models balance exploration and exploitation, deal with problems, and reach conclusions.
arXiv Detail & Related papers (2025-06-20T14:02:16Z)
The CoT Encyclopedia: Analyzing, Predicting, and Controlling how a Reasoning Model will Think [81.38614558541772]
We introduce the CoT Encyclopedia, a framework for analyzing and steering model reasoning.<n>Our method automatically extracts diverse reasoning criteria from model-generated CoTs.<n>We show that this framework produces more interpretable and comprehensive analyses than existing methods.
arXiv Detail & Related papers (2025-05-15T11:31:02Z)
Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [58.86928947970342]
Embodied-R is a framework combining large-scale Vision-Language Models for perception and small-scale Language Models for reasoning. After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models. Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration.
arXiv Detail & Related papers (2025-04-17T06:16:11Z)
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models [88.29990536278167]
We introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs.<n>Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities.
arXiv Detail & Related papers (2024-12-16T09:47:43Z)
A NotSo Simple Way to Beat Simple Bench [0.0]
This paper presents a novel framework for enhancing reasoning capabilities in large language models (LLMs)<n>We propose a multi-step prompting strategy coupled with global consistency checks to improve model accuracy and robustness.<n>Our results reveal model-specific strengths: Claude excels in maintaining logical consistency, while GPT-4o exhibits exploratory creativity but struggles with ambiguous prompts.
arXiv Detail & Related papers (2024-12-12T16:04:31Z)
Benchmarking Mental State Representations in Language Models [9.318796743761224]
Research into the models' internal representation of mental states remains limited. Recent work has used probing to demonstrate that LMs can represent beliefs of themselves and others. We report an extensive benchmark with various LM types with different model sizes. We are the first to study how prompt variations impact probing performance on theory of mind tasks.
arXiv Detail & Related papers (2024-06-25T12:51:06Z)
Brainstorming Brings Power to Large Language Models of Knowledge Reasoning [17.14501985068287]
Large Language Models (LLMs) have demonstrated amazing capabilities in language generation, text comprehension, and knowledge reasoning. Recent studies have further improved the model's reasoning ability on a wide range of tasks by introducing multi-model collaboration. We propose the multi-model brainstorming based on prompt. It incorporates different models into a group for brainstorming, and after multiple rounds of reasoning elaboration and re-inference, a consensus answer is reached.
arXiv Detail & Related papers (2024-06-02T14:47:14Z)
The Buffer Mechanism for Multi-Step Information Reasoning in Language Models [52.77133661679439]
Investigating internal reasoning mechanisms of large language models can help us design better model architectures and training strategies. In this study, we constructed a symbolic dataset to investigate the mechanisms by which Transformer models employ vertical thinking strategy. We proposed a random matrix-based algorithm to enhance the model's reasoning ability, resulting in a 75% reduction in the training time required for the GPT-2 model.
arXiv Detail & Related papers (2024-05-24T07:41:26Z)
Self-Discover: Large Language Models Self-Compose Reasoning Structures [136.48389510481758]
We introduce SELF-DISCOVER, a framework for self-discovering task-intrinsic reasoning structures. SELF-DISCOVER substantially improves GPT-4 and PaLM 2's performance on challenging reasoning benchmarks. We show that the self-discovered reasoning structures are universally applicable across model families.
arXiv Detail & Related papers (2024-02-06T01:13:53Z)
Evaluating Concurrent Robustness of Language Models Across Diverse Challenge Sets [46.19529338280716]
Language models, characterized by their black-box nature, often hallucinate and display sensitivity to input perturbations.<n>We introduce a methodology designed to examine how input perturbations affect language models across various scales.<n>We present three distinct fine-tuning strategies to address robustness against multiple perturbations.
arXiv Detail & Related papers (2023-11-15T02:59:10Z)
Exploring The Landscape of Distributional Robustness for Question Answering Models [47.178481044045505]
Investigation spans over 350 models and 16 question answering datasets. We find that, in many cases, model variations do not affect robustness. We release all evaluations to encourage researchers to further analyze robustness trends for question answering models.
arXiv Detail & Related papers (2022-10-22T18:17:31Z)
Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups. We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z)
A Minimalist Dataset for Systematic Generalization of Perception, Syntax, and Semantics [131.93113552146195]
We present a new dataset, Handwritten arithmetic with INTegers (HINT), to examine machines' capability of learning generalizable concepts. In HINT, machines are tasked with learning how concepts are perceived from raw signals such as images. We undertake extensive experiments with various sequence-to-sequence models, including RNNs, Transformers, and GPT-3.
arXiv Detail & Related papers (2021-03-02T01:32:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.