GramTrans: A Better Code Representation Approach in Code Generation
- URL: http://arxiv.org/abs/2510.02887v1
- Date: Fri, 03 Oct 2025 10:49:33 GMT
- Title: GramTrans: A Better Code Representation Approach in Code Generation
- Authors: Zhao Zhang, Qingyuan Liang, Zeyu Sun, Yizhou Chen, Guoqing Wang, Yican Sun, Lu Zhang, Ge Li, Yingfei Xiong,
- Abstract summary: This paper proposes a conjecture: the easier a representation is to parse, the better performance the model achieves.<n>We present GramTrans, a general approach that automatically transforms a context-free language into a representation within the LL(1) class.
- Score: 31.09799107794881
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Code generation has shown great promise in assisting software development. A fundamental yet underexplored question is how the choice of code representation affects model performance. While existing studies employ various representations, such as treating code as plain text, grammar rule sequences, or syntax tree sequences, they lack a principled understanding of the relationship between parsing difficulty and model effectiveness. This paper proposes a conjecture: the easier a representation is to parse, the better performance the model achieves. We formalize this idea using grammar classes, where representations in simpler classes (e.g., LL(1)) are easier to parse. Through a controlled experiment on a Python-based DSL, we show that parsing difficulty strongly correlates with model performance. Motivated by this finding, we present GramTrans, a general approach that automatically transforms a context-free language into a representation within the LL(1) class. GramTrans introduces a novel hierarchical conflict elimination algorithm, enabling a flexible trade-off between syntactic simplicity and token efficiency. We evaluate GramTrans on both Python and Java using three code generation models: StarCoder 1B, DeepSeek-Coder 1.3B, and Qwen2.5 1.5B. Across multiple benchmarks, GramTrans consistently delivers significant improvements over baseline representations. Furthermore, our analysis of existing representations reconfirms the strong alignment between parsing difficulty and model performance, providing additional support for the conjecture.
Related papers
- On the Effect of Token Merging on Pre-trained Models for Code [11.029842116504726]
We investigate the effect of merging the hidden representations of subtokens that belong to the same semantic unit.<n>We propose two strategies: one based on averaging the representations and another that leverages a learning-based approach.<n>Results show that these strategies can reduce the number of floating-point operations by $1%$ to $19%$.
arXiv Detail & Related papers (2025-07-19T00:48:20Z) - Improving Arithmetic Reasoning Ability of Large Language Models through Relation Tuples, Verification and Dynamic Feedback [14.938401898546553]
We propose to use a semi-structured form to represent reasoning steps of large language models.
Specifically, we use relations, which are not only human but also machine-friendly and easier to verify than natural language.
arXiv Detail & Related papers (2024-06-25T18:21:00Z) - SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code
Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects.
We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z) - Abstract Syntax Tree for Programming Language Understanding and
Representation: How Far Are We? [23.52632194060246]
Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering.
The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning.
We compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks.
arXiv Detail & Related papers (2023-12-01T08:37:27Z) - Alleviating Over-smoothing for Unsupervised Sentence Representation [96.19497378628594]
We present a Simple method named Self-Contrastive Learning (SSCL) to alleviate this issue.
Our proposed method is quite simple and can be easily extended to various state-of-the-art models for performance boosting.
arXiv Detail & Related papers (2023-05-09T11:00:02Z) - Outline, Then Details: Syntactically Guided Coarse-To-Fine Code
Generation [61.50286000143233]
ChainCoder is a program synthesis language model that generates Python code progressively.
A tailored transformer architecture is leveraged to jointly encode the natural language descriptions and syntactically aligned I/O data samples.
arXiv Detail & Related papers (2023-04-28T01:47:09Z) - A Syntax-Guided Multi-Task Learning Approach for Turducken-Style Code
Generation [19.489202790935902]
We propose a syntax-guided multi-task learning approach TurduckenGen.
Specifically, we first explicitly append the type information to the code tokens to capture the representation of syntactic constraints.
Then we formalize code generation with syntactic constraint representation as an auxiliary task to enable the model to learn the syntactic constraints of the code.
arXiv Detail & Related papers (2023-03-09T06:22:07Z) - BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and
Semantic Parsing [55.058258437125524]
We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing.
We benchmark eight language models, including two GPT-3 variants available only through an API.
Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
arXiv Detail & Related papers (2022-06-21T18:34:11Z) - UnifieR: A Unified Retriever for Large-Scale Retrieval [84.61239936314597]
Large-scale retrieval is to recall relevant documents from a huge collection given a query.
Recent retrieval methods based on pre-trained language models (PLM) can be coarsely categorized into either dense-vector or lexicon-based paradigms.
We propose a new learning framework, UnifieR which unifies dense-vector and lexicon-based retrieval in one model with a dual-representing capability.
arXiv Detail & Related papers (2022-05-23T11:01:59Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.