Can Emulating Semantic Translation Help LLMs with Code Translation? A Study Based on Pseudocode
- URL: http://arxiv.org/abs/2510.00920v2
- Date: Fri, 31 Oct 2025 09:01:40 GMT
- Title: Can Emulating Semantic Translation Help LLMs with Code Translation? A Study Based on Pseudocode
- Authors: Songqiang Chen, Congying Xu, Jingyi Chen, Jialun Cao, Jiarong Wu, Shing-Chi Cheung,
- Abstract summary: Pseudocode-based translation emulates the human semantic translation by first interpreting the program's intent and logic into pseudocode.<n>We find that pseudocode-based translation helps translate programs that direct translation struggles to handle.
- Score: 9.384417259861438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) show great potential in code translation. However, accurate translation remains challenging when using the commonly adopted direct code-to-code translation approach, which converts a program into the target programming language (PL) in a single step. Inspired by the success of incorporating intermediate steps to guide LLMs in resolving challenging tasks, we explore pseudocode-based code translation, which emulates the human semantic translation by first interpreting the program's intent and logic into pseudocode and then implementing it in the target PL. We find that pseudocode-based translation helps translate programs that direct translation struggles to handle. Nonetheless, the effectiveness, advantages, and limitations of this approach remain underexplored. To bridge this gap, we present an empirical study on pseudocode-based code translation, aiming to investigate its effectiveness in enhancing the direct translation approach, illuminate its effective usage, and identify limitations hindering its potential benefits. By comparing direct and pseudocode-based translation approaches on 9,690 translation tasks across six PLs with five popular LLMs, we demonstrate that pseudocode-based translation can effectively complement direct translation, particularly when translating from flexible to rigid PLs or dealing with low-resource Rust. Based on these findings, we suggest adopting strategies that combine the complementary strengths of both approaches to enhance code translation accuracy. We also reveal the advantages of pseudocode-based translation in disentangling translations of complicated programs and mitigating distractions from detailed implementations in original programs, as well as its limitations due to incorrect, incomplete, or ambiguous pseudocode.
Related papers
- LLM Based Long Code Translation using Identifier Replacement [3.833075202213095]
We propose a novel zero-shot code translation method that incorporates identifier replacement.<n>By substituting user-given long identifiers with generalized placeholders during translation, our method improves the efficiency and cost-effectiveness of long code translation.
arXiv Detail & Related papers (2025-10-10T06:28:15Z) - Function-to-Style Guidance of LLMs for Code Translation [59.487054943812836]
We propose F2STrans, a function-to-style guiding paradigm designed to improve the performance of large language models in code translation.<n>Our approach comprises two key stages: (1) Functional learning, which optimize translation correctness using high-quality source-target code pairs.<n>We introduce a novel code translation benchmark that includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations.
arXiv Detail & Related papers (2025-07-15T08:25:02Z) - ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding [60.37988508851391]
Language models (LMs) have become a staple of the code-writing toolbox.<n>Research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse.<n>In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency.
arXiv Detail & Related papers (2025-03-27T23:08:53Z) - Lost in Literalism: How Supervised Training Shapes Translationese in LLMs [51.04435855143767]
Large language models (LLMs) have achieved remarkable success in machine translation.<n>However, translationese, characterized by overly literal and unnatural translations, remains a persistent challenge.<n>We introduce methods to mitigate these biases, including polishing golden references and filtering unnatural training instances.
arXiv Detail & Related papers (2025-03-06T12:14:45Z) - ExeCoder: Empowering Large Language Models with Executability Representation for Code Translation [57.604506522287814]
Existing large language models (LLMs) only learn the contextual semantics of code during pre-training.<n>We propose ExeCoder to utilize executability representations such as functional semantics, syntax structures, and variable dependencies.<n>We show that ExeCoder achieves state-of-the-art performance in code translation, surpassing existing open-source code LLMs by over 10.88% to 38.78% and over 27.44% to 42.97% on two metrics.
arXiv Detail & Related papers (2025-01-30T16:18:52Z) - Specification-Driven Code Translation Powered by Large Language Models: How Far Are We? [8.534857249221844]
We investigate using NL-specification as an intermediate representation for code translation.<n>Our results show that using NL-specification alone does not lead to performance improvements.<n>Besides analyzing the performance of code translation, we also investigate the quality of the translated code.
arXiv Detail & Related papers (2024-12-05T20:10:21Z) - LLM-based Translation Inference with Iterative Bilingual Understanding [52.46978502902928]
We propose a novel Iterative Bilingual Understanding Translation method based on the cross-lingual capabilities of large language models (LLMs)<n>The cross-lingual capability of LLMs enables the generation of contextual understanding for both the source and target languages separately.<n>The proposed IBUT outperforms several strong comparison methods.
arXiv Detail & Related papers (2024-10-16T13:21:46Z) - Semantic Alignment-Enhanced Code Translation via an LLM-Based Multi-Agent System [24.52067108242477]
Code translation is crucial for software migration, system ablation, and cross-platform development.<n>Traditional rule-based methods rely on manually-written rules, which can be time-consuming and often result in less readable code.<n>More recently, the advance of Large Language Models (LLMs) further boosts learning-based code translation.<n>We propose a novel multi-agent system TRANSAGENT, which enhances LLM-based code translation by fixing the syntax errors and semantic errors.
arXiv Detail & Related papers (2024-09-30T02:53:03Z) - Towards Translating Real-World Code with LLMs: A Study of Translating to Rust [13.743967357458287]
Large language models (LLMs) show promise in code translation due to their ability to write code in most programming languages.<n>We conduct our study on code extracted from real-world open source projects.<n> FLOURINE is an end-to-end code translation tool that uses differential fuzzing to check if a Rust translation is I/O equivalent to the original source program.
arXiv Detail & Related papers (2024-05-19T10:54:03Z) - Exploring and Unleashing the Power of Large Language Models in Automated Code Translation [40.25727029618665]
This paper investigates diverse LLMs and learning-based transpilers for automated code translation tasks.
UniTrans is a Unified code Translation framework, applicable to various LLMs.
Three recent LLMs of diverse sizes are tested with UniTrans, and all achieve substantial improvements.
arXiv Detail & Related papers (2024-04-23T00:49:46Z) - Building Accurate Translation-Tailored LLMs with Language Aware Instruction Tuning [57.323716555996114]
Off-target translation remains an unsolved problem, especially for low-resource languages.
Recent works have either designed advanced prompting strategies to highlight the functionality of translation instructions or exploited the in-context learning ability of LLMs.
In this work, we design a two-stage fine-tuning algorithm to improve the instruction-following ability (especially the translation direction) of LLMs.
arXiv Detail & Related papers (2024-03-21T13:47:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.