LLM Based Long Code Translation using Identifier Replacement
- URL: http://arxiv.org/abs/2510.09045v2
- Date: Fri, 31 Oct 2025 08:20:14 GMT
- Title: LLM Based Long Code Translation using Identifier Replacement
- Authors: Manojit Chakraborty, Madhusudan Ghosh, Rishabh Gupta,
- Abstract summary: We propose a novel zero-shot code translation method that incorporates identifier replacement.<n>By substituting user-given long identifiers with generalized placeholders during translation, our method improves the efficiency and cost-effectiveness of long code translation.
- Score: 3.833075202213095
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the domain of software development, LLMs have been utilized to automate tasks such as code translation, where source code from one programming language is translated to another while preserving its functionality. However, LLMs often struggle with long source codes that don't fit into the context window, which produces inaccurate translations. To address this, we propose a novel zero-shot code translation method that incorporates identifier replacement. By substituting user-given long identifiers with generalized placeholders during translation, our method allows the LLM to focus on the logical structure of the code, by reducing token count and memory usage, which improves the efficiency and cost-effectiveness of long code translation. Our empirical results demonstrate that our approach preserves syntactical and hierarchical information and produces translation results with reduced tokens.
Related papers
- Can Emulating Semantic Translation Help LLMs with Code Translation? A Study Based on Pseudocode [9.384417259861438]
Pseudocode-based translation emulates the human semantic translation by first interpreting the program's intent and logic into pseudocode.<n>We find that pseudocode-based translation helps translate programs that direct translation struggles to handle.
arXiv Detail & Related papers (2025-10-01T13:58:19Z) - Decoding in Latent Spaces for Efficient Inference in LLM-based Recommendation [75.72196852363116]
Light Latent-space Decoding (L2D) is an effective and efficient latent-space decoding method.<n>L2D is more than 10x faster than language-space decoding while maintaining or enhancing performance.
arXiv Detail & Related papers (2025-09-15T02:30:35Z) - IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z) - Function-to-Style Guidance of LLMs for Code Translation [59.487054943812836]
We propose F2STrans, a function-to-style guiding paradigm designed to improve the performance of large language models in code translation.<n>Our approach comprises two key stages: (1) Functional learning, which optimize translation correctness using high-quality source-target code pairs.<n>We introduce a novel code translation benchmark that includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations.
arXiv Detail & Related papers (2025-07-15T08:25:02Z) - ExeCoder: Empowering Large Language Models with Executability Representation for Code Translation [57.604506522287814]
Existing large language models (LLMs) only learn the contextual semantics of code during pre-training.<n>We propose ExeCoder to utilize executability representations such as functional semantics, syntax structures, and variable dependencies.<n>We show that ExeCoder achieves state-of-the-art performance in code translation, surpassing existing open-source code LLMs by over 10.88% to 38.78% and over 27.44% to 42.97% on two metrics.
arXiv Detail & Related papers (2025-01-30T16:18:52Z) - Scalable, Validated Code Translation of Entire Projects using Large Language Models [13.059046327936393]
Large language models (LLMs) show promise in code translation due to their ability to generate idiomatic code.<n>Existing works have shown a drop in translation success rates for code exceeding around 100 lines.<n>We develop a modular approach to translation, where we partition the code into small code fragments which can be independently translated.<n>We show that we can consistently generate reliable Rust for projects up to 6,600 lines of code and 369 functions, with an average of 73% of functions successfully validated for I/O equivalence.
arXiv Detail & Related papers (2024-12-11T02:31:46Z) - Specification-Driven Code Translation Powered by Large Language Models: How Far Are We? [8.534857249221844]
We investigate using NL-specification as an intermediate representation for code translation.<n>Our results show that using NL-specification alone does not lead to performance improvements.<n>Besides analyzing the performance of code translation, we also investigate the quality of the translated code.
arXiv Detail & Related papers (2024-12-05T20:10:21Z) - Semantic Alignment-Enhanced Code Translation via an LLM-Based Multi-Agent System [24.52067108242477]
Code translation is crucial for software migration, system ablation, and cross-platform development.<n>Traditional rule-based methods rely on manually-written rules, which can be time-consuming and often result in less readable code.<n>More recently, the advance of Large Language Models (LLMs) further boosts learning-based code translation.<n>We propose a novel multi-agent system TRANSAGENT, which enhances LLM-based code translation by fixing the syntax errors and semantic errors.
arXiv Detail & Related papers (2024-09-30T02:53:03Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.<n>CodeIP is a novel multi-bit watermarking technique that inserts additional information to preserve provenance details.<n>Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.