REMODEL-LLM: Transforming C code to Java using LLMs
- URL: http://arxiv.org/abs/2512.11402v1
- Date: Fri, 12 Dec 2025 09:25:10 GMT
- Title: REMODEL-LLM: Transforming C code to Java using LLMs
- Authors: Aryan Gupta, Y. Raghu Reddy,
- Abstract summary: We use a novel, hybrid pipeline that leverages Abstract Syntax Trees (ASTs) for semantic decomposition and employs a highly constrained, rule based prompting strategy.<n>The vast majority of models (Tier 3, e.g., llama3.1, gemma3, starcoder2) failed 100% of the tests, proving incapable of generating even basic, runnable Java boilerplate.<n>A small middle tier (Tier 2, e.g., mistral-nemo and mistral) produced runnable code but was plagued by dangerous semantic failures and wrong translations.
- Score: 4.189643331553923
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: The automated translation of C code to Java code is a notoriously difficult task, fraught with challenges stemming from fundamental paradigm shifts (procedural vs. Object Oriented), memory models (manual pointers vs. Garbage Collection), and incompatible data types. This paper investigates the efficacy of 19 small, quantized LLMs (under 20 billion parameters) for the C to Java translation task. We use a novel, hybrid pipeline that leverages Abstract Syntax Trees (ASTs) for semantic decomposition and employs a highly constrained, rule based prompting strategy. The results are stark: a clear multi tiered performance divide emerged. The vast majority of models (Tier 3, e.g., llama3.1, gemma3, starcoder2) failed 100\% of the tests, proving incapable of generating even basic, runnable Java boilerplate. A small middle tier (Tier 2, e.g., mistral-nemo and mistral) produced runnable code but was plagued by dangerous semantic failures and wrong translations. Only three models (Tier 1: phi4, deepseek-coder-v2, codeqwen) proved viable, passing over 50\% of the test suite. Even these top models failed on the most complex C concepts, such as function pointers, sizeof, and enum logic, revealing a hard ceiling for the reasoning capabilities of current quantized models.
Related papers
- AlgoVeri: An Aligned Benchmark for Verified Code Generation on Classical Algorithms [54.99368693313797]
Existing benchmarks test only individual languages/tools, so the performance numbers are not directly comparable.<n>We address this gap with AlgoVeri, a benchmark that evaluates vericoding of $77$ classical algorithms in Dafny, Verus, and Lean.
arXiv Detail & Related papers (2026-02-10T06:58:26Z) - GramTrans: A Better Code Representation Approach in Code Generation [31.09799107794881]
This paper proposes a conjecture: the easier a representation is to parse, the better performance the model achieves.<n>We present GramTrans, a general approach that automatically transforms a context-free language into a representation within the LL(1) class.
arXiv Detail & Related papers (2025-10-03T10:49:33Z) - On the Effect of Token Merging on Pre-trained Models for Code [11.029842116504726]
We investigate the effect of merging the hidden representations of subtokens that belong to the same semantic unit.<n>We propose two strategies: one based on averaging the representations and another that leverages a learning-based approach.<n>Results show that these strategies can reduce the number of floating-point operations by $1%$ to $19%$.
arXiv Detail & Related papers (2025-07-19T00:48:20Z) - Simplicity by Obfuscation: Evaluating LLM-Driven Code Transformation with Semantic Elasticity [4.458584890504334]
Code obfuscation aims to prevent reverse engineering and intellectual property theft.<n>The recent development of large language models paves the way for practical applications in different domains.<n>This work performs an empirical study on the ability of LLMs to obfuscate Python source code.
arXiv Detail & Related papers (2025-04-18T18:29:23Z) - Type-Constrained Code Generation with Language Models [51.03439021895432]
We introduce a type-constrained decoding approach that leverages type systems to guide code generation.<n>For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code.<n>Our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z) - Unmasking the Genuine Type Inference Capabilities of LLMs for Java Code Snippets [8.294192850975767]
Large Language Models (LLMs) are used to perform type inference for online code snippets.<n> StatType-SO, the benchmark used for evaluation, has been publicly available on GitHub since 2017.<n>This paper strives to comprehensively evaluate the genuine type inference capabilities of LLMs on Java code snippets.
arXiv Detail & Related papers (2025-03-06T04:13:40Z) - LINC: A Neurosymbolic Approach for Logical Reasoning by Combining
Language Models with First-Order Logic Provers [60.009969929857704]
Logical reasoning is an important task for artificial intelligence with potential impacts on science, mathematics, and society.
In this work, we reformulating such tasks as modular neurosymbolic programming, which we call LINC.
We observe significant performance gains on FOLIO and a balanced subset of ProofWriter for three different models in nearly all experimental conditions we evaluate.
arXiv Detail & Related papers (2023-10-23T17:58:40Z) - A Comprehensive Review of State-of-The-Art Methods for Java Code
Generation from Natural Language Text [0.0]
This paper provides a comprehensive review of the evolution and progress of deep learning models in Java code generation task.
We focus on the most important methods and present their merits and limitations, as well as the objective functions used by the community.
arXiv Detail & Related papers (2023-06-10T07:27:51Z) - A Static Evaluation of Code Completion by Large Language Models [65.18008807383816]
Execution-based benchmarks have been proposed to evaluate functional correctness of model-generated code on simple programming problems.
static analysis tools such as linters, which can detect errors without running the program, haven't been well explored for evaluating code generation models.
We propose a static evaluation framework to quantify static errors in Python code completions, by leveraging Abstract Syntax Trees.
arXiv Detail & Related papers (2023-06-05T19:23:34Z) - PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems.
PaL offloads the solution step to a programmatic runtime such as a Python interpreter.
We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z) - Interactive Code Generation via Test-Driven User-Intent Formalization [60.90035204567797]
Large language models (LLMs) produce code from informal natural language (NL) intent.
It is hard to define a notion of correctness since natural language can be ambiguous and lacks a formal semantics.
We describe a language-agnostic abstract algorithm and a concrete implementation TiCoder.
arXiv Detail & Related papers (2022-08-11T17:41:08Z) - Evaluating few shot and Contrastive learning Methods for Code Clone
Detection [5.1623866691702744]
Code Clone Detection is a software engineering task that is used for plagiarism detection, code search, and code comprehension.
Deep learning-based models have achieved an F1 score (a metric used to assess classifiers) of $sim$95% on the CodeXGLUE benchmark.
No previous study evaluates the generalizability of these models where a limited amount of annotated data is available.
arXiv Detail & Related papers (2022-04-15T15:01:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.