Quality Evaluation of COBOL to Java Code Transformation
- URL: http://arxiv.org/abs/2507.23356v1
- Date: Thu, 31 Jul 2025 09:06:20 GMT
- Title: Quality Evaluation of COBOL to Java Code Transformation
- Authors: Shmulik Froimovich, Raviv Gal, Wesam Ibraheem, Avi Ziv,
- Abstract summary: We present an automated evaluation system for assessing code translation within IBM's watsonx Code Assistant for Z (WCA4Z)<n>The system addresses key challenges in evaluating LLM-based translators, including model opacity and the complexity of translation quality assessment.
- Score: 0.22499166814992438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present an automated evaluation system for assessing COBOL-to-Java code translation within IBM's watsonx Code Assistant for Z (WCA4Z). The system addresses key challenges in evaluating LLM-based translators, including model opacity and the complexity of translation quality assessment. Our approach combines analytic checkers with LLM-as-a-judge (LaaJ) techniques to deliver scalable, multi-faceted evaluations. The system supports continuous integration workflows, enables large-scale benchmarking, and reduces reliance on manual review. We describe the system architecture, evaluation strategies, and reporting mechanisms that provide actionable insights for developers and project managers, facilitating the evolution of high-quality, modernized codebases.
Related papers
- Automated Validation of LLM-based Evaluators for Software Engineering Artifacts [0.7548538278943616]
REFINE (Ranking Evaluators for FIne grained Nuanced Evaluation) is an automated framework for benchmarking large language models (LLMs)<n> REFINE applies novel generation techniques to automatically synthesize artifacts with progressively reduced quality.<n>It quantifies each candidate evaluator configuration by measuring how closely its rankings align with expected ordering.
arXiv Detail & Related papers (2025-08-04T18:52:01Z) - evalSmarT: An LLM-Based Framework for Evaluating Smart Contract Generated Comments [0.0]
We present texttevalSmarT, a modular framework that leverages large language models (LLMs) as evaluators.<n>We demonstrate its application in benchmarking comment generation tools and selecting the most informative outputs.
arXiv Detail & Related papers (2025-07-28T12:37:43Z) - Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z) - MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z) - SIMCOPILOT: Evaluating Large Language Models for Copilot-Style Code Generation [5.880496520248658]
SIMCOPILOT is a benchmark that simulates the role of large language models (LLMs) as interactive, "copilot"-style coding assistants.<n>The benchmark comprises dedicated sub-benchmarks for Java (SIMCOPILOTJ) and Python.
arXiv Detail & Related papers (2025-05-21T04:59:44Z) - Skeleton-Guided-Translation: A Benchmarking Framework for Code Repository Translation with Fine-Grained Quality Evaluation [37.25839260805938]
Skeleton-Guided-Translation is a framework for repository-level Java to C# code translation with fine-grained quality evaluation.<n>We present TRANSREPO-BENCH, a benchmark of high quality open-source Java repositories and their corresponding C# skeletons.
arXiv Detail & Related papers (2025-01-27T13:44:51Z) - Can Large Language Models Serve as Evaluators for Code Summarization? [47.21347974031545]
Large Language Models (LLMs) serve as effective evaluators for code summarization methods.<n>LLMs prompt an agent to play diverse roles, such as code reviewer, code author, code editor, and system analyst.<n> CODERPE achieves an 81.59% Spearman correlation with human evaluations, outperforming the existing BERTScore metric by 17.27%.
arXiv Detail & Related papers (2024-12-02T09:56:18Z) - Leveraging Large Language Models for NLG Evaluation: Advances and Challenges [57.88520765782177]
Large Language Models (LLMs) have opened new avenues for assessing generated content quality, e.g., coherence, creativity, and context relevance.
We propose a coherent taxonomy for organizing existing LLM-based evaluation metrics, offering a structured framework to understand and compare these methods.
By discussing unresolved challenges, including bias, robustness, domain-specificity, and unified evaluation, this paper seeks to offer insights to researchers and advocate for fairer and more advanced NLG evaluation techniques.
arXiv Detail & Related papers (2024-01-13T15:59:09Z) - DIALIGHT: Lightweight Multilingual Development and Evaluation of
Task-Oriented Dialogue Systems with Large Language Models [76.79929883963275]
DIALIGHT is a toolkit for developing and evaluating multilingual Task-Oriented Dialogue (ToD) systems.
It features a secure, user-friendly web interface for fine-grained human evaluation at both local utterance level and global dialogue level.
Our evaluations reveal that while PLM fine-tuning leads to higher accuracy and coherence, LLM-based systems excel in producing diverse and likeable responses.
arXiv Detail & Related papers (2024-01-04T11:27:48Z) - AlignBench: Benchmarking Chinese Alignment of Large Language Models [99.24597941555277]
We introduce AlignBench, a comprehensive benchmark for evaluating Chinese Large Language Models' alignment.
We design a human-in-the-loop data curation pipeline, containing eight main categories, 683 real-scenario rooted queries and corresponding human verified references.
For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judgecitezheng2023judging approach with Chain-of-Thought to generate explanations and final ratings.
arXiv Detail & Related papers (2023-11-30T17:41:30Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.