Related papers: Understanding Long Programming Languages with Structure-Aware Sparse Attention

Understanding Long Programming Languages with Structure-Aware Sparse Attention

URL: http://arxiv.org/abs/2205.13730v1
Date: Fri, 27 May 2022 02:50:57 GMT
Title: Understanding Long Programming Languages with Structure-Aware Sparse Attention
Authors: Tingting Liu, Chengyu Wang, Cen Chen, Ming Gao, Aoying Zhou
Abstract summary: We present SASA, a Structure-Aware Sparse Attention mechanism, which reduces the complexity and improves performance for long code understanding tasks. The key components in SASA are top-$k$ sparse attention and Abstract Syntax Tree (AST)-based structure-aware attention. Experiments on CodeXGLUE tasks show that SASA achieves better performance than the competing baselines.
Score: 32.21325784213584
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Programming-based Pre-trained Language Models (PPLMs) such as CodeBERT have achieved great success in many downstream code-related tasks. Since the memory and computational complexity of self-attention in the Transformer grow quadratically with the sequence length, PPLMs typically limit the code length to 512. However, codes in real-world applications are generally long, such as code searches, which cannot be processed efficiently by existing PPLMs. To solve this problem, in this paper, we present SASA, a Structure-Aware Sparse Attention mechanism, which reduces the complexity and improves performance for long code understanding tasks. The key components in SASA are top-$k$ sparse attention and Abstract Syntax Tree (AST)-based structure-aware attention. With top-$k$ sparse attention, the most crucial attention relation can be obtained with a lower computational cost. As the code structure represents the logic of the code statements, which is a complement to the code sequence characteristics, we further introduce AST structures into attention. Extensive experiments on CodeXGLUE tasks show that SASA achieves better performance than the competing baselines.

Related papers

FastCode: Fast and Cost-Efficient Code Understanding and Reasoning [32.264145740214616]
Repository-scale code reasoning is a cornerstone of modern AI-assisted software engineering.<n>FastCode is a framework that decouples repository exploration from content consumption.
arXiv Detail & Related papers (2026-03-01T09:41:15Z)
Do It for HER: First-Order Temporal Logic Reward Specification in Reinforcement Learning (Extended Version) [49.462399222747024]
We propose a novel framework for the logical specification of non-Markovian rewards in Decision Processes (MDPs) with large state spaces.<n>Our approach leverages Linear Temporal Logic Modulo Theories over finite traces (LTLfMT)<n>We introduce a method based on reward machines and Hindsight Experience Replay (HER) to translate first-order logic specifications and address reward sparsity.
arXiv Detail & Related papers (2026-02-05T22:11:28Z)
LongCodeZip: Compress Long Context for Code Language Models [16.940525379087326]
LongCodeZip is a novel plug-and-play code compression framework designed specifically for Large Language Models (LLMs)<n>By effectively reducing context size while preserving essential information, LongCodeZip enables LLMs to better scale to real-world, large-scale code scenarios.
arXiv Detail & Related papers (2025-10-01T02:54:57Z)
Post-Incorporating Code Structural Knowledge into LLMs via In-Context Learning for Code Translation [10.77747590700758]
Large language models (LLMs) have achieved significant advancements in software mining. handling the syntactic structure of source code remains a challenge. This paper employs incontext learning (ICL) to integrate code structural knowledge into pre-trained LLMs.
arXiv Detail & Related papers (2025-03-28T10:59:42Z)
EpiCoder: Encompassing Diversity and Complexity in Code Generation [49.170195362149386]
We introduce a novel feature tree-based synthesis framework inspired by Abstract Syntax Trees (AST) Unlike AST, which captures syntactic structure of code, our framework models semantic relationships between code elements. We fine-tuned widely-used base models to create the EpiCoder series, achieving state-of-the-art performance at both the function and file levels.
arXiv Detail & Related papers (2025-01-08T18:58:15Z)
Enhancing LLM's Cognition via Structurization [41.13997892843677]
Large language models (LLMs) process input contexts through a causal and sequential perspective. This paper presents a novel concept of context structurization. Specifically, we transform the plain, unordered contextual sentences into well-ordered and hierarchically structurized elements.
arXiv Detail & Related papers (2024-07-23T12:33:58Z)
Large Language Model for Verilog Generation with Code-Structure-Guided Reinforcement Learning [29.135207235743795]
This paper introduces VeriSeek, an LLM enhanced by reinforcement learning to achieve high Verilog code generation performance. Our reinforcement learning approach employs code structure information as feedback signals to refine the pre-trained model. Experiments show that VeriSeek outperforms state-of-the-art methods across multiple benchmarks.
arXiv Detail & Related papers (2024-07-21T11:25:21Z)
Comments as Natural Logic Pivots: Improve Code Generation via Comment Perspective [85.48043537327258]
We propose MANGO (comMents As Natural loGic pivOts), including a comment contrastive training strategy and a corresponding logical comment decoding strategy. Results indicate that MANGO significantly improves the code pass rate based on the strong baselines. The robustness of the logical comment decoding strategy is notably higher than the Chain-of-thoughts prompting.
arXiv Detail & Related papers (2024-04-11T08:30:46Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects. We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z)
When Do Program-of-Thoughts Work for Reasoning? [51.2699797837818]
We propose complexity-impacted reasoning score (CIRS) to measure correlation between code and reasoning abilities. Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity. Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.
arXiv Detail & Related papers (2023-08-29T17:22:39Z)
AST-MHSA : Code Summarization using Multi-Head Self-Attention [1.588193964339148]
We present a model, AST-MHSA, that uses multi-head attention to extract semantic information from the abstract syntax tree (AST) of the code. The model is trained on a dataset of code and summaries, and the parameters are optimized to minimize the loss between the generated summaries and the ground-truth summaries.
arXiv Detail & Related papers (2023-08-10T15:43:46Z)
LongCoder: A Long-Range Pre-trained Language Model for Code Completion [56.813974784131624]
LongCoder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens. Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction. memory tokens are included to highlight important statements that may be invoked later and need to be memorized.
arXiv Detail & Related papers (2023-06-26T17:59:24Z)
Improving Code Summarization with Block-wise Abstract Syntax Tree Splitting [15.28941592388958]
Abstract Syntax Tree (AST), which depicts the source code's syntactic structure, has been incorporated to guide the generation of code summaries. Existing AST based methods suffer from the difficulty of training and generate inadequate code summaries. We present the Block-wise Abstract Syntax Tree Splitting method (BASTS), which fully utilizes the rich tree-form syntax structure in ASTs.
arXiv Detail & Related papers (2021-03-14T05:04:06Z)
GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.