Related papers: The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner

The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner

URL: http://arxiv.org/abs/2507.13332v1
Date: Thu, 17 Jul 2025 17:50:07 GMT
Title: The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner
Authors: Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Kai Chen,
Abstract summary: This paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of large language models.<n>TAIL synthesizes chain-of-thoughts (CoT) data that imitates the execution process of a Turing Machine by computer programs.<n>Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks.
Score: 16.32735852049866
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.

Related papers

FOL-Pretrain: A complexity annotated corpus of first-order logic [16.061040115094592]
Transformer-based large language models (LLMs) have demonstrated remarkable reasoning capabilities.<n>Despite recent efforts to reverse-engineer LLM behavior, our understanding of how these models internalize and execute complex algorithms remains limited.<n>We introduce a large-scale, fully open, complexity-annotated dataset of first-order logic reasoning traces.
arXiv Detail & Related papers (2025-05-20T21:38:28Z)
RV-Syn: Rational and Verifiable Mathematical Reasoning Data Synthesis based on Structured Function Library [58.404895570822184]
RV-Syn is a novel mathematical Synthesis approach.<n>It generates graphs as solutions by combining Python-formatted functions from this library.<n>Based on the constructed graph, we achieve solution-guided logic-aware problem generation.
arXiv Detail & Related papers (2025-04-29T04:42:02Z)
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.43407125275202]
o1-like models can emulate human-like long-time thinking during inference.<n>This paper presents the first comprehensive study on the prevalent issue of overthinking in these models.<n>We propose strategies to mitigate overthinking, streamlining reasoning processes without compromising accuracy.
arXiv Detail & Related papers (2024-12-30T18:55:12Z)
Learning Linear Attention in Polynomial Time [115.68795790532289]
We provide the first results on learnability of single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. We show how to efficiently identify training datasets for which every empirical riskr is equivalent to the linear Transformer.
arXiv Detail & Related papers (2024-10-14T02:41:01Z)
Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.<n>This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z)
Discovering physical laws with parallel combinatorial tree search [57.05912962368898]
Symbolic regression plays a crucial role in scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data.<n>Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade.<n>We introduce a parallel tree search (PCTS) model to efficiently distill generic mathematical expressions from limited data.
arXiv Detail & Related papers (2024-07-05T10:41:15Z)
Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions [47.83142414018448]
We focus on two popular reasoning tasks: arithmetic reasoning and code generation. We introduce (i) a general ontology of perturbations for math and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets. We show a significant performance drop across all the models against perturbed questions.
arXiv Detail & Related papers (2024-01-17T18:13:07Z)
MARIO: MAth Reasoning with code Interpreter Output -- A Reproducible Pipeline [12.186691561822256]
We postulate that the inherent nature of large language models (LLMs) presents challenges in modeling mathematical reasoning. This paper introduces a novel math dataset, enhanced with a capability to utilize a Python code interpreter. We propose a tentative, easily replicable protocol for the fine-tuning of math-specific LLMs.
arXiv Detail & Related papers (2024-01-16T08:08:01Z)
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning [63.80739044622555]
We introduce MuSR, a dataset for evaluating language models on soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning.
arXiv Detail & Related papers (2023-10-24T17:59:20Z)
Auto-Regressive Next-Token Predictors are Universal Learners [17.416520406390415]
We show that even simple models such as linear next-token predictors can approximate any function efficiently computed by a Turing machine. We also show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks.
arXiv Detail & Related papers (2023-09-13T14:15:03Z)
Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective [39.47116013338394]
Chain-of-Thought prompting (CoT) can dramatically improve the performance of Large Language Models (LLMs) We show that CoT can handle a general class of decision-making problems known as Dynamic Programming.
arXiv Detail & Related papers (2023-05-24T17:59:21Z)
Self-Supervised Learning via Maximum Entropy Coding [57.56570417545023]
We propose Maximum Entropy Coding (MEC) as a principled objective that explicitly optimize on the structure of the representation. MEC learns a more generalizable representation than previous methods based on specific pretext tasks. It achieves state-of-the-art performance consistently on various downstream tasks, including not only ImageNet linear probe, but also semi-supervised classification, object detection, instance segmentation, and object tracking.
arXiv Detail & Related papers (2022-10-20T17:58:30Z)
PAC-learning gains of Turing machines over circuits and neural networks [1.4502611532302039]
We study the potential gains in sample efficiency that can bring in the principle of minimum description length. We use Turing machines to represent universal models and circuits. We highlight close relationships between classical open problems in Circuit Complexity and the tightness of these.
arXiv Detail & Related papers (2021-03-23T17:03:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.