TQL: Scaling Q-Functions with Transformers by Preventing Attention Collapse
- URL: http://arxiv.org/abs/2602.01439v1
- Date: Sun, 01 Feb 2026 21:10:43 GMT
- Title: TQL: Scaling Q-Functions with Transformers by Preventing Attention Collapse
- Authors: Perry Dong, Kuo-Han Hung, Alexander Swerdlow, Dorsa Sadigh, Chelsea Finn,
- Abstract summary: Transformer Q-Learning unlocks the scaling potential of transformers in learning value functions in reinforcement learning.<n>Our approach yields up to a 43% improvement in performance when scaling from the smallest to the largest network sizes.
- Score: 100.14462819905822
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite scale driving substantial recent advancements in machine learning, reinforcement learning (RL) methods still primarily use small value functions. Naively scaling value functions -- including with a transformer architecture, which is known to be highly scalable -- often results in learning instability and worse performance. In this work, we ask what prevents transformers from scaling effectively for value functions? Through empirical analysis, we identify the critical failure mode in this scaling: attention scores collapse as capacity increases. Our key insight is that we can effectively prevent this collapse and stabilize training by controlling the entropy of the attention scores, thereby enabling the use of larger models. To this end, we propose Transformer Q-Learning (TQL), a method that unlocks the scaling potential of transformers in learning value functions in RL. Our approach yields up to a 43% improvement in performance when scaling from the smallest to the largest network sizes, while prior methods suffer from performance degradation.
Related papers
- Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction [16.426476430697587]
We present a novel approach to predict the Short-Time Objective Intelligibility (STOI) metric using a bottleneck transformer architecture.<n>Our model has shown higher correlation and lower mean squared error for both seen and unseen scenarios.
arXiv Detail & Related papers (2026-02-17T10:46:54Z) - Provable In-Context Learning of Nonlinear Regression with Transformers [66.99048542127768]
In-context learning (ICL) is the ability to perform unseen tasks using task specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL, with much of the focus on relatively simple tasks.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z) - Kolmogorov-Arnold Transformer [72.88137795439407]
We introduce the Kolmogorov-Arnold Transformer (KAT), a novel architecture that replaces layers with Kolmogorov-Arnold Network (KAN) layers.
We identify three key challenges: (C1) Base function, (C2) Inefficiency, and (C3) Weight.
With these designs, KAT outperforms traditional-based transformers.
arXiv Detail & Related papers (2024-09-16T17:54:51Z) - Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis [16.253898272659242]
This study focuses on Transformer-based LLMs, specifically applying low-rank parametrization to feedforward networks (FFNs)
Experiments on the large RefinedWeb dataset show that low-rank parametrization is both efficient (e.g., 2.6$times$ FFN speed-up with 32% parameters) and effective during training.
Motivated by this finding, we develop the wide and structured networks surpassing the current medium-sized and large-sized Transformer in perplexity and throughput performance.
arXiv Detail & Related papers (2024-07-13T10:08:55Z) - Stop Regressing: Training Value Functions via Classification for
Scalable Deep RL [109.44370201929246]
We show that training value functions with categorical cross-entropy improves performance and scalability in a variety of domains.
These include: single-task RL on Atari 2600 games with SoftMoEs, multi-task RL on Atari with large-scale ResNets, robotic manipulation with Q-transformers, playing Chess without search, and a language-agent Wordle task with high-capacity Transformers.
arXiv Detail & Related papers (2024-03-06T18:55:47Z) - Power Transformer Fault Prediction Based on Knowledge Graphs [9.690455133923667]
The scarcity of extensive fault data makes it difficult to apply machine learning techniques effectively.
We propose a novel approach that leverages the knowledge graph (KG) technology in combination with gradient boosting decision trees (GBDT)
This method is designed to efficiently learn from a small set of high-dimensional data, integrating various factors influencing transformer faults and historical operational data.
arXiv Detail & Related papers (2024-02-11T19:14:28Z) - Q-Transformer: Scalable Offline Reinforcement Learning via
Autoregressive Q-Functions [143.89572689302497]
We present a scalable reinforcement learning method for training multi-task policies from large offline datasets.
Our method uses a Transformer to provide a scalable representation for Q-functions trained via offline temporal difference backups.
We show that Q-Transformer outperforms prior offline RL algorithms and imitation learning techniques on a large diverse real-world robotic manipulation task suite.
arXiv Detail & Related papers (2023-09-18T21:00:38Z) - Quantizable Transformers: Removing Outliers by Helping Attention Heads
Do Nothing [18.673619610942197]
Modern transformer models tend to learn strong outliers in their activations, making them difficult to quantize.
We show that strong outliers are related to very specific behavior of attention heads that try to learn a "no-op" or just a partial update of the residual.
We propose two simple (independent) modifications to the attention mechanism - clipped softmax and gated attention.
arXiv Detail & Related papers (2023-06-22T14:39:04Z) - Center Smoothing for Certifiably Robust Vector-Valued Functions [59.46976586742266]
We produce certifiable robustness for vector-valued functions bound to change in output caused by a small change in input.
We demonstrate the effectiveness of our method on multiple learning tasks involving vector-valued functions with a wide range of input and output dimensionalities.
arXiv Detail & Related papers (2021-02-19T01:34:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.