CodeSSM: Towards State Space Models for Code Understanding
- URL: http://arxiv.org/abs/2505.01475v2
- Date: Wed, 21 May 2025 15:24:04 GMT
- Title: CodeSSM: Towards State Space Models for Code Understanding
- Authors: Shweta Verma, Abhinav Anand, Mira Mezini,
- Abstract summary: State Space Models (SSMs) are a potential alternative to transformers for code understanding tasks.<n>SSMs are more compute-efficient than transformers.<n>We show that SSMs are also more sample-efficient and can effectively extrapolate to longer contexts.
- Score: 1.8838588087156363
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Although transformers are widely used for various code-specific tasks, they have some significant limitations. In this paper, we investigate State Space Models (SSMs) as a potential alternative to transformers for code understanding tasks, such as code retrieval, classification, and clone detection. Previous research has already demonstrated that SSMs are more compute-efficient than transformers. In our work, we show that SSMs are also more sample-efficient and can effectively extrapolate to longer contexts (beyond the pretraining context) during fine-tuning. Through comprehensive experiments, we demonstrate that SSMs could serve as a viable alternative to transformers for code understanding tasks, while addressing some of the major limitations associated with transformers.
Related papers
- Towards Understanding What State Space Models Learn About Code [5.605881212882263]
State Space Models (SSMs) have emerged as an efficient alternative to the transformer architecture.<n>Recent studies show that SSMs can match or surpass Transformers on code understanding tasks, such as code retrieval, when trained under similar conditions.<n>We present the first systematic analysis of what SSM-based code models actually learn and perform the first comparative analysis of SSM and Transformer-based code models.
arXiv Detail & Related papers (2026-02-06T15:29:46Z) - Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models [68.31088463716269]
We propose a structured sparse parametrization of transition matrices in state-space models (SSMs)<n>Our method, PD-SSM, parametrizes the transition matrix as the product of a column one-hot matrix ($P$) and a complex-valued diagonal matrix ($D$)<n>The model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks.
arXiv Detail & Related papers (2025-09-26T12:46:30Z) - RoboSSM: Scalable In-context Imitation Learning via State-Space Models [35.91619896213736]
In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations.<n>Recent ICIL methods rely on Transformers, which have computational limitations.<n>We introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models.
arXiv Detail & Related papers (2025-09-24T00:26:15Z) - STree: Speculative Tree Decoding for Hybrid State-Space Models [41.65137016153309]
We propose a scalable algorithm to perform tree-based speculative decoding in state-space models (SSMs) and hybrid architectures of SSMs and Transformer layers.<n>Along with the algorithm, we describe a hardware-aware implementation that improves naive application of AR Transformer tree-based speculative decoding methods to SSMs.
arXiv Detail & Related papers (2025-05-20T23:12:16Z) - On the Expressiveness and Length Generalization of Selective State-Space Models on Regular Languages [56.22289522687125]
Selective state-space models (SSMs) are an emerging alternative to the Transformer.<n>We analyze their expressiveness and length generalization performance on regular language tasks.<n>We introduce the Selective Dense State-Space Model (SD-SSM), the first selective SSM that exhibits perfect length generalization.
arXiv Detail & Related papers (2024-12-26T20:53:04Z) - Longhorn: State Space Models are Amortized Online Learners [51.10124201221601]
State-space models (SSMs) offer linear decoding efficiency while maintaining parallelism during training.
In this work, we explore SSM design through the lens of online learning, conceptualizing SSMs as meta-modules for specific online learning problems.
We introduce a novel deep SSM architecture, Longhorn, whose update resembles the closed-form solution for solving the online associative recall problem.
arXiv Detail & Related papers (2024-07-19T11:12:08Z) - The Expressive Capacity of State Space Models: A Formal Language Perspective [0.8948475969696075]
recurrent models based on linear state space models (SSMs) have shown promising performance in language modeling (LM), competititve with transformers.
We present a comprehensive theoretical study of the capacity of such SSMs as it compares to that of transformers and traditional RNNs.
arXiv Detail & Related papers (2024-05-27T17:46:57Z) - MambaAD: Exploring State Space Models for Multi-class Unsupervised Anomaly Detection [53.03687787922032]
Mamba-based models with superior long-range modeling and linear efficiency have garnered substantial attention.<n>This study pioneers the application of Mamba to multi-class unsupervised anomaly detection, presenting MambaAD.<n>The proposed LSS module, integrating parallel cascaded (Hybrid State Space) HSS blocks and multi- kernel convolutions operations, effectively captures both long-range and local information.
arXiv Detail & Related papers (2024-04-09T18:28:55Z) - RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems [51.171355532527365]
Retrieval-augmented generation (RAG) can significantly improve the performance of language models (LMs)
RAGGED is a framework for analyzing RAG configurations across various document-based question answering tasks.
arXiv Detail & Related papers (2024-03-14T02:26:31Z) - Do Efficient Transformers Really Save Computation? [32.919672616480135]
We focus on the capabilities and limitations of efficient Transformers, specifically the Sparse Transformer and the Linear Transformer.
Our results show that while these models are expressive enough to solve general DP tasks, contrary to expectations, they require a model size that scales with the problem size.
We identify a class of DP problems for which these models can be more efficient than the standard Transformer.
arXiv Detail & Related papers (2024-02-21T17:00:56Z) - Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z) - Representation Learning in a Decomposed Encoder Design for Bio-inspired Hebbian Learning [5.67478985222587]
We propose a modular framework trained with a bio-inspired variant of contrastive predictive coding, comprising parallel encoders that leverage different invariant visual descriptors as inductive biases.<n>Our findings indicate that this form of inductive bias significantly improves the robustness of learned representations and narrows the performance gap between models.
arXiv Detail & Related papers (2023-11-22T07:58:14Z) - TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation [9.477734501499274]
We present TransformCode, a novel framework that learns code embeddings in a contrastive learning manner.
Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language.
arXiv Detail & Related papers (2023-11-10T09:05:23Z) - Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space.
We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z) - TransCoder: Towards Unified Transferable Code Representation Learning Inspired by Human Skills [31.75121546422898]
We present TransCoder, a unified Transferable fine-tuning strategy for Code representation learning.
We employ a tunable prefix encoder as the meta-learner to capture cross-task and cross-language transferable knowledge.
Our method can lead to superior performance on various code-related tasks and encourage mutual reinforcement.
arXiv Detail & Related papers (2023-05-23T06:59:22Z) - Dynamic Grained Encoder for Vision Transformers [150.02797954201424]
This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images.
We propose a Dynamic Grained for vision transformers, which can adaptively assign a suitable number of queries to each spatial region.
Our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2023-01-10T07:55:29Z) - Pretraining Without Attention [114.99187017618408]
This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs)
BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation.
arXiv Detail & Related papers (2022-12-20T18:50:08Z) - MASTER: Multi-task Pre-trained Bottlenecked Masked Autoencoders are
Better Dense Retrievers [140.0479479231558]
In this work, we aim to unify a variety of pre-training tasks into a multi-task pre-trained model, namely MASTER.
MASTER utilizes a shared-encoder multi-decoder architecture that can construct a representation bottleneck to compress the abundant semantic information across tasks into dense vectors.
arXiv Detail & Related papers (2022-12-15T13:57:07Z) - Systematic Generalization and Emergent Structures in Transformers
Trained on Structured Tasks [6.525090891505941]
We show how a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions.
We show that two-layer transformers learn generalizable solutions to multi-level problems and develop signs of systematic task decomposition.
These results provide key insights into how transformer models may be capable of decomposing complex decisions into reusable, multi-level policies.
arXiv Detail & Related papers (2022-10-02T00:46:36Z) - Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z) - CodeRL: Mastering Code Generation through Pretrained Models and Deep
Reinforcement Learning [92.36705236706678]
"CodeRL" is a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning.
During inference, we introduce a new generation procedure with a critical sampling strategy.
For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives.
arXiv Detail & Related papers (2022-07-05T02:42:15Z) - Learning Bounded Context-Free-Grammar via LSTM and the
Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks.
In practice, it is often observed that Transformer models have better representation power than LSTM.
We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z) - Thinking Like Transformers [64.96770952820691]
We propose a computational model for the transformer-encoder in the form of a programming language.
We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer.
We provide RASP programs for histograms, sorting, and Dyck-languages.
arXiv Detail & Related papers (2021-06-13T13:04:46Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z) - AutoTrans: Automating Transformer Design via Reinforced Architecture
Search [52.48985245743108]
This paper empirically explore how to set layer-norm, whether to scale, number of layers, number of heads, activation function, etc, so that one can obtain a transformer architecture that better suits the tasks at hand.
Experiments on the CoNLL03, Multi-30k, IWSLT14 and WMT-14 shows that the searched transformer model can outperform the standard transformers.
arXiv Detail & Related papers (2020-09-04T08:46:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.