Related papers: Chain of Thought in Order: Discovering Learning-Friendly Orders for Arithmetic

Chain of Thought in Order: Discovering Learning-Friendly Orders for Arithmetic

URL: http://arxiv.org/abs/2506.23875v1
Date: Mon, 30 Jun 2025 14:05:53 GMT
Title: Chain of Thought in Order: Discovering Learning-Friendly Orders for Arithmetic
Authors: Yuta Sato, Kazuhiko Kawamoto, Hiroshi Kera,
Abstract summary: This study addresses a novel task of unraveling chain of thought - reordering decoder input tokens to a learning-friendly sequence for Transformers to learn arithmetic tasks.<n>The proposed pipeline first trains a Transformer on a mixture of target sequences arranged in different orders and then identifies benign orders as those with fast loss drops in the early stage.<n>Experiments on four order-sensitive arithmetic tasks show that our method identifies a learning-friendly order out of a few billion candidates.
Score: 5.2980803808373516
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The chain of thought is fundamental in Transformers, which is to perform step-by-step reasoning. Besides what intermediate steps work, the order of these steps critically affects the difficulty of the reasoning. This study addresses a novel task of unraveling chain of thought - reordering decoder input tokens to a learning-friendly sequence for Transformers to learn arithmetic tasks. The proposed pipeline first trains a Transformer on a mixture of target sequences arranged in different orders and then identifies benign orders as those with fast loss drops in the early stage. As the search space grows factorially with sequence length, we propose a two-stage hierarchical approach for inter- and intra-block reordering. Experiments on four order-sensitive arithmetic tasks show that our method identifies a learning-friendly order out of a few billion candidates. Notably, on the multiplication task, it recovered the reverse-digit order reported in prior studies.

Related papers

How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias [48.9399496805422]
We focus on two representative tasks in the category of regular language recognition, known as even pairs' and parity check'<n>Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks.
arXiv Detail & Related papers (2025-05-02T00:07:35Z)
Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers [0.0]
I introduce the retrieval problem, a simple yet common reasoning task that can be solved only by transformers with a minimum number of layers.<n>I empirically show that large language models can solve the task under different prompting formulations without any fine-tuning.
arXiv Detail & Related papers (2024-11-18T23:12:13Z)
Algorithmic Capabilities of Random Transformers [49.73113518329544]
We investigate what functions can be learned by randomly transformers in which only the embedding layers are optimized. We find that these random transformers can perform a wide range of meaningful algorithmic tasks. Our results indicate that some algorithmic capabilities are present in transformers even before these models are trained.
arXiv Detail & Related papers (2024-10-06T06:04:23Z)
Reverse That Number! Decoding Order Matters in Arithmetic Learning [49.5504492920404]
Our work introduces a novel strategy that reevaluates the digit order by prioritizing output from the least significant digit. Compared to the previous state-of-the-art (SOTA) method, our findings reveal an overall improvement of in accuracy while requiring only a third of the tokens typically used during training.
arXiv Detail & Related papers (2024-03-09T09:04:53Z)
Chunk, Align, Select: A Simple Long-sequence Processing Method for Transformers [24.109312575970456]
We propose a simple framework to enable the offthe-shelf pre-trained transformers to process much longer sequences. Our method divides each long-sequence input into a batch of chunks, then aligns the interchunk information during the encoding steps. We learn an effective hidden selection policy, which regards the decoders of transformers as environments.
arXiv Detail & Related papers (2023-08-25T05:52:05Z)
STEPS: A Benchmark for Order Reasoning in Sequential Tasks [16.52934509949172]
We describe the data construction and task formulations, and benchmark most of significant Large Language Models (LLMs) The experimental results demonstrate 1) The commonsense reasoning of action orders in sequential tasks are challenging to resolve via zero-shot prompting or few-shot in-context learning.
arXiv Detail & Related papers (2023-06-07T13:58:55Z)
Faith and Fate: Limits of Transformers on Compositionality [109.79516190693415]
We investigate the limits of transformer large language models across three representative compositional tasks. These tasks require breaking problems down into sub-steps and synthesizing these steps into a precise answer. Our empirical findings suggest that transformer LLMs solve compositional tasks by reducing multi-step compositional reasoning into linearized subgraph matching.
arXiv Detail & Related papers (2023-05-29T23:24:14Z)
SVIP: Sequence VerIfication for Procedures in Videos [68.07865790764237]
We propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations. Such a challenging task resides in an open-set setting without prior action detection or segmentation. We collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments.
arXiv Detail & Related papers (2021-12-13T07:03:36Z)
Discovering Non-monotonic Autoregressive Orderings with Variational Inference [67.27561153666211]
We develop an unsupervised parallelizable learner that discovers high-quality generation orders purely from training data. We implement the encoder as a Transformer with non-causal attention that outputs permutations in one forward pass. Empirical results in language modeling tasks demonstrate that our method is context-aware and discovers orderings that are competitive with or even better than fixed orders.
arXiv Detail & Related papers (2021-10-27T16:08:09Z)
Topological Sort for Sentence Ordering [133.05105352571715]
We propose a new framing of this task as a constraint solving problem and introduce a new technique to solve it. The results on both automatic and human metrics across four different datasets show that this new technique is better at capturing coherence in documents.
arXiv Detail & Related papers (2020-05-01T15:07:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.