Related papers: ReleaseEval: A Benchmark for Evaluating Language Models in Automated Release Note Generation

ReleaseEval: A Benchmark for Evaluating Language Models in Automated Release Note Generation

URL: http://arxiv.org/abs/2511.02713v1
Date: Tue, 04 Nov 2025 16:31:44 GMT
Title: ReleaseEval: A Benchmark for Evaluating Language Models in Automated Release Note Generation
Authors: Qianru Meng, Zhaochun Ren, Joost Visser,
Abstract summary: ReleaseEval is a benchmark designed to evaluate language models for automated release note generation.<n>It comprises 94,987 release notes from 3,369 repositories across 6 programming languages.<n>Both automated and human evaluations show that large language models consistently outperform traditional baselines.
Score: 20.424587551582153
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated release note generation addresses the challenge of documenting frequent software updates, where manual efforts are time-consuming and prone to human error. Although recent advances in language models further enhance this process, progress remains hindered by dataset limitations, including the lack of explicit licensing and limited reproducibility, and incomplete task design that relies mainly on commit messages for summarization while overlooking fine-grained contexts such as commit hierarchies and code changes. To fill this gap, we introduce ReleaseEval, a reproducible and openly licensed benchmark designed to systematically evaluate language models for automated release note generation. ReleaseEval comprises 94,987 release notes from 3,369 repositories across 6 programming languages, and supports three task settings with three levels of input granularity: (1) commit2sum, which generates release notes from commit messages; (2) tree2sum, which incorporates commit tree structures; and (3) diff2sum, which leverages fine-grained code diffs. Both automated and human evaluations show that large language models consistently outperform traditional baselines across all tasks, achieving substantial gains on tree2sum, while still struggling on diff2sum. These findings highlight LLMs' proficiency in leveraging structured information while revealing challenges in abstracting from long code diffs.

Related papers

DiffuRank: Effective Document Reranking with Diffusion Language Models [71.16830004674513]
We propose DiffuRank, a reranking framework built upon diffusion language models (dLLMs)<n>dLLMs support more flexible decoding and generation processes that are not constrained to a left-to-right order.<n>We show dLLMs achieve performance comparable to, and in some cases exceeding, that of autoregressive LLMs with similar model sizes.
arXiv Detail & Related papers (2026-02-13T02:18:14Z)
SemanticForge: Repository-Level Code Generation through Semantic Knowledge Graphs and Constraint Satisfaction [7.46733617565624]
Large language models (LLMs) have transformed software development by enabling automated code generation, yet they frequently suffer from systematic errors that limit practical deployment.<n>We identify two critical failure modes: textitlogical hallucination (incorrect control/data-flow reasoning) and textitschematic hallucination (type mismatches, signature violations, and architectural inconsistencies).<n>This paper presents textbfSemanticForge, which introduces four fundamental algorithmic advances for semantically-aware code generation.
arXiv Detail & Related papers (2025-11-10T19:53:23Z)
PLSemanticsBench: Large Language Models As Programming Language Interpreters [31.611330217819713]
As large language models (LLMs) excel at code reasoning, a natural question arises: can an LLM execute programs (i.e., act as an interpreter) purely based on a programming language's formal semantics?<n>We study this question using the imperative language IMP, formalized via small-step operational semantics (SOS) and rewriting-based operational semantics (K-semantics)<n>We introduce three evaluation sets-Human-Written, LLM-Translated, and Fuzzer- Generated-whose difficulty is controlled by code-complexity metrics spanning the size, control-flow, and data-flow axes.
arXiv Detail & Related papers (2025-10-03T18:23:26Z)
SLICET5: Static Program Slicing using Language Models with Copy Mechanism and Constrained Decoding [13.61350801915956]
Static program slicing is a fundamental technique in software engineering.<n>ourtool is a novel slicing framework that reformulates static program slicing as a sequence-to-sequence task.<n>ourtool consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-09-22T03:14:47Z)
Code Review Without Borders: Evaluating Synthetic vs. Real Data for Review Recommendation [37.86790434630698]
Large Language Models (LLMs) are used to translate code changes from well-resourced languages into equivalent changes in underrepresented or emerging languages.<n>We compare the performance of these models against models trained on real labelled data.<n>This approach provides a scalable pathway to extend automated code review capabilities to rapidly evolving technology stacks.
arXiv Detail & Related papers (2025-09-05T05:17:14Z)
MRG-Bench: Evaluating and Exploring the Requirements of Context for Repository-Level Code Generation [0.7342677574855649]
We introduce textbfMRG-Bench, a novel dataset that provides a more accurate evaluation of large language models.<n>We conduct experiments including large language models, long-context models, and RAG-related methods.<n>Results show that the majority of methods suffer from "textbfdifficulty in understanding user requirements," failing to comprehend their assigned tasks accurately.
arXiv Detail & Related papers (2025-08-05T01:53:45Z)
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [58.05959902776133]
We introduce Single-Pass.<n>with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation.<n>We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP)<n>On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $sim$16% of training samples compared to human-labeled and other synthetically trained baselines.
arXiv Detail & Related papers (2025-06-18T14:37:59Z)
Localizing Factual Inconsistencies in Attributable Text Generation [74.11403803488643]
We introduce QASemConsistency, a new formalism for localizing factual inconsistencies in attributable text generation.<n>We show that QASemConsistency yields factual consistency scores that correlate well with human judgments.
arXiv Detail & Related papers (2024-10-09T22:53:48Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Information Association for Language Model Updating by Mitigating LM-Logical Discrepancy [68.31760483418901]
Large Language Models(LLMs) struggle with providing current information due to the outdated pre-training data. Existing methods for updating LLMs, such as knowledge editing and continual fine-tuning, have significant drawbacks in generalizability of new information. We identify the core challenge behind these drawbacks: the LM-logical discrepancy featuring the difference between language modeling probabilities and logical probabilities.
arXiv Detail & Related papers (2023-05-29T19:48:37Z)
mFACE: Multilingual Summarization with Factual Consistency Evaluation [79.60172087719356]
Abstractive summarization has enjoyed renewed interest in recent years, thanks to pre-trained language models and the availability of large-scale datasets. Despite promising results, current models still suffer from generating factually inconsistent summaries. We leverage factual consistency evaluation models to improve multilingual summarization.
arXiv Detail & Related papers (2022-12-20T19:52:41Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.