Related papers: M2G-Eval: Enhancing and Evaluating Multi-granularity Multilingual Code Generation

M2G-Eval: Enhancing and Evaluating Multi-granularity Multilingual Code Generation

URL: http://arxiv.org/abs/2512.22628v1
Date: Sat, 27 Dec 2025 16:00:46 GMT
Title: M2G-Eval: Enhancing and Evaluating Multi-granularity Multilingual Code Generation
Authors: Fanglin Xu, Wei Zhang, Jian Yang, Guo Chen, Aishan Liu, Zhoujun Li, Xianglong Liu, Bryan Dai,
Abstract summary: We introduce M2G-Eval, a framework for evaluating code generation in large language models (LLMs) across four levels: Class, Function, Block, and Line.<n>M2G-Eval includes 17K+ training tasks and 1,286 human-annotated, contamination-controlled test instances.
Score: 42.21777678623796
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancement of code large language models (LLMs) has sparked significant research interest in systematically evaluating their code generation capabilities, yet existing benchmarks predominantly assess models at a single structural granularity and focus on limited programming languages, obscuring fine-grained capability variations across different code scopes and multilingual scenarios. We introduce M2G-Eval, a multi-granularity, multilingual framework for evaluating code generation in large language models (LLMs) across four levels: Class, Function, Block, and Line. Spanning 18 programming languages, M2G-Eval includes 17K+ training tasks and 1,286 human-annotated, contamination-controlled test instances. We develop M2G-Eval-Coder models by training Qwen3-8B with supervised fine-tuning and Group Relative Policy Optimization. Evaluating 30 models (28 state-of-the-art LLMs plus our two M2G-Eval-Coder variants) reveals three main findings: (1) an apparent difficulty hierarchy, with Line-level tasks easiest and Class-level most challenging; (2) widening performance gaps between full- and partial-granularity languages as task complexity increases; and (3) strong cross-language correlations, suggesting that models learn transferable programming concepts. M2G-Eval enables fine-grained diagnosis of code generation capabilities and highlights persistent challenges in synthesizing complex, long-form code.

Related papers

MaDiS: Taming Masked Diffusion Language Models for Sign Language Generation [78.75809158246723]
We present MaDiS, a masked-diffusion-based language model for SLG that captures bidirectional and supports efficient parallel multi-token generation.<n>We also introduce a tri-level cross-modal pretraining scheme that jointly learns from token-, latent-Hearing, and 3D-space objectives.<n>MaDiS achieves superior performance across multiple metrics, including DTW error and two newly introduced metrics, SiBLEU and SiCLIP, while reducing inference latency by nearly 30%.
arXiv Detail & Related papers (2026-01-27T13:06:47Z)
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models [59.90381306452982]
evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer.<n>We introduce SWE-1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework.<n>SWE- spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests.
arXiv Detail & Related papers (2025-11-07T18:01:32Z)
AutoCodeBench: Large Language Models are Automatic Code Benchmark Generators [11.285930594120076]
We introduce AutoCodeGen, an automated method for generating high-difficulty multilingual code generation datasets without manual annotations.<n>We evaluate over 30 leading open-source and proprietary LLMs on AutoCodeBench and its simplified version AutoCodeBench-Lite.<n>The results show that even the most advanced LLMs struggle with the complexity, diversity, and multilingual nature of these tasks.
arXiv Detail & Related papers (2025-08-12T17:29:20Z)
IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
MGS3: A Multi-Granularity Self-Supervised Code Search Framework [22.214324677526132]
We introduce a novel Multi-Granularity Self-Supervised contrastive learning code Search framework (MGS$3$)<n>First, MGS$3$ features a Supervised Multi-Granularity Representation module (HMGR), which aggregates fine-grained information into coarser-grained representations.<n>We conduct extensive experiments on code search benchmarks across various granularities, demonstrating that the framework exhibits outstanding performance in code search tasks of multiple granularities.
arXiv Detail & Related papers (2025-05-30T06:49:39Z)
CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation [11.174059895410359]
Large language models (LLMs) have demonstrated strong capabilities in code generation.<n>Existing evaluation approaches fall into three categories, including human-centered, metric-based, and LLM-based.<n>We propose CodeVisionary, the first agent-based evaluation framework for complex code generation.
arXiv Detail & Related papers (2025-04-18T05:26:32Z)
M2rc-Eval: Massively Multilingual Repository-level Code Completion Evaluation [39.6123499117046]
We propose a massively multilingual repository-level code completion benchmark covering 18 programming languages. Two types of fine-grained annotations (i.e., bucket-level and semantic-level) are provided on different completion scenarios. We also curate a massively multilingual instruction corpora M2RC- INSTRUCT dataset to improve the repository-level code completion abilities of existing code Large Language Models.
arXiv Detail & Related papers (2024-10-28T15:58:41Z)
Multi-Programming Language Ensemble for Code Generation in Large Language Model [5.882816711878273]
Large language models (LLMs) have significantly improved code generation, particularly in one-pass code generation. Most existing approaches focus solely on generating code in a single programming language, overlooking the potential of leveraging the multi-language capabilities of LLMs. We propose Multi-Programming Language Ensemble (MPLE), a novel ensemble-based method that utilizes code generation across multiple programming languages to enhance overall performance.
arXiv Detail & Related papers (2024-09-06T08:31:18Z)
The Struggles of LLMs in Cross-lingual Code Clone Detection [3.5202378300682162]
Cross-lingual code clone detection has gained traction within the software engineering community.<n>Inspired by the significant advances in machine learning, this paper revisits cross-lingual code clone detection.<n>We evaluate the performance of five (05) Large Language Models (LLMs) and eight prompts (08) for the identification of cross-lingual code clones.
arXiv Detail & Related papers (2024-08-08T12:57:14Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [92.62952504133926]
This study evaluated the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks.<n>We developed a taxonomy of bugs for incorrect codes and analyzed the root cause for common bug types.<n>We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Branch-Solve-Merge Improves Large Language Model Evaluation and Generation [136.7876524839751]
Large Language Models (LLMs) are frequently used for multi-faceted language generation and evaluation tasks. We propose Branch-Merge (BSM), a Large Language Model program (Schlag et al., 2023) for tackling such challenging natural language tasks. BSM improves the evaluation correctness and consistency for each LLM by enhancing human-LLM agreement by up to 26%.
arXiv Detail & Related papers (2023-10-23T17:29:48Z)
L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs) We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods. In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.