Related papers: Evaluating the Impact of Data Cleaning on the Quality of Generated Pull Request Descriptions

Evaluating the Impact of Data Cleaning on the Quality of Generated Pull Request Descriptions

URL: http://arxiv.org/abs/2505.01120v1
Date: Fri, 02 May 2025 08:58:42 GMT
Title: Evaluating the Impact of Data Cleaning on the Quality of Generated Pull Request Descriptions
Authors: Kutay Tire, Berk Çakar, Eray Tüzün,
Abstract summary: Pull Requests (PRs) are central to collaborative coding.<n>Many PRs are incomplete, uninformative, or have out-of-context content.<n>This study examines the prevalence of "noisy" PRs and evaluates their impact on description generation models.
Score: 2.2134505920972547
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pull Requests (PRs) are central to collaborative coding, summarizing code changes for reviewers. However, many PR descriptions are incomplete, uninformative, or have out-of-context content, compromising developer workflows and hindering AI-based generation models trained on commit messages and original descriptions as "ground truth." This study examines the prevalence of "noisy" PRs and evaluates their impact on state-of-the-art description generation models. To do so, we propose four cleaning heuristics to filter noise from an initial dataset of 169K+ PRs drawn from 513 GitHub repositories. We train four models-BART, T5, PRSummarizer, and iTAPE-on both raw and cleaned datasets. Performance is measured via ROUGE-1, ROUGE-2, and ROUGE-L metrics, alongside a manual evaluation to assess description quality improvements from a human perspective. Cleaning the dataset yields significant gains: average F1 improvements of 8.6% (ROUGE-1), 8.7% (ROUGE-2), and 8.5% (ROUGE-L). Manual assessment confirms higher readability and relevance in descriptions generated by the best-performing model, BART when trained on cleaned data. Dataset refinement markedly enhances PR description generation, offering a foundation for more accurate AI-driven tools and guidelines to assist developers in crafting high-quality PR descriptions.

Related papers

Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning [14.254037571895404]
Large Language Models (LLMs) have demonstrated remarkable progress through preference-based fine-tuning.<n>This paper introduces Refine-n-Judge, an automated iterative approach that leverages a single LLM as both a refiner and a judge to enhance dataset quality.<n>We demonstrate the effectiveness of Refine-n-Judge across a range of public datasets spanning five corpora, targeting tasks such as coding, math, and conversation.
arXiv Detail & Related papers (2025-08-03T01:56:03Z)
Advancing Retrieval-Augmented Generation for Structured Enterprise and Internal Data [0.0]
Large Language Models (LLMs) have strong generative capabilities.<n>They are limited by static pretraining, short context windows, and challenges in processing heterogeneous data formats.<n>Conventional Retrieval-Augmented Generation (RAG) frameworks address some of these gaps but often struggle with structured and semi-structured data.<n>This work proposes an advanced RAG framework that combines hybrid retrieval strategies using dense embeddings (all-mpnet-base-v2) and BM25, enhanced by metadata-aware filtering with SpaCy NER and cross-encoder reranking.
arXiv Detail & Related papers (2025-07-16T17:13:06Z)
Enhancing Training Data Attribution with Representational Optimization [57.61977909113113]
Training data attribution methods aim to measure how training data impacts a model's predictions.<n>We propose AirRep, a representation-based approach that closes this gap by learning task-specific and model-aligned representations explicitly for TDA.<n>AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence.
arXiv Detail & Related papers (2025-05-24T05:17:53Z)
Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval [54.68474647525667]
We prune 8 out of 15 datasets from the BGE collection and increase nDCG@10 on BEIR by 1.0 point.<n>We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives.<n>Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR.
arXiv Detail & Related papers (2025-05-22T17:47:57Z)
R-PRM: Reasoning-Driven Process Reward Modeling [53.06844294668382]
Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning step.<n>Existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy.<n>We propose Reasoning-Driven Process Reward Modeling (R-PRM)<n>R-PRM generates seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities.
arXiv Detail & Related papers (2025-03-27T09:23:08Z)
Too Noisy To Learn: Enhancing Data Quality for Code Review Comment Generation [2.990411348977783]
Open-source datasets are used to train neural models for automated code review tasks.<n>These datasets contain a significant amount of noisy comments that persist despite cleaning methods.<n>We propose a novel approach using large language models (LLMs) to further clean these datasets.
arXiv Detail & Related papers (2025-02-04T22:48:58Z)
Prompting and Fine-tuning Large Language Models for Automated Code Review Comment Generation [5.6001617185032595]
Large language models pretrained on both programming and natural language data tend to perform well in code-oriented tasks. We fine-tune open-source Large language models (LLM) in parameter-efficient, quantized low-rank fashion on consumer-grade hardware to improve review comment generation.
arXiv Detail & Related papers (2024-11-15T12:01:38Z)
Improving Embedding Accuracy for Document Retrieval Using Entity Relationship Maps and Model-Aware Contrastive Sampling [0.0]
APEX-Embedding-7B is a 7-billion parameter decoder-only text Feature Extraction Model. Our approach employs two training techniques that yield an emergent improvement in factual focus. Based on our evaluations, our model establishes a new state-of-the-art standard in text feature extraction for longer context document retrieval tasks.
arXiv Detail & Related papers (2024-10-08T17:36:48Z)
Automatic Pull Request Description Generation Using LLMs: A T5 Model Approach [0.0]
We propose an automated method for generating PR descriptions based on commit messages and source code comments. We fine-tuned a pre-trained T5 model using a dataset containing 33,466 PRs. Our findings reveal that the T5 model significantly outperforms LexRank.
arXiv Detail & Related papers (2024-08-01T21:22:16Z)
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback [110.16220825629749]
Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements.
arXiv Detail & Related papers (2024-06-13T16:17:21Z)
Silkie: Preference Distillation for Large Visual Language Models [56.10697821410489]
This paper explores preference distillation for large vision language models (LVLMs) We first build a vision-language feedback dataset utilizing AI annotation. We adopt GPT-4V to assess the generated outputs regarding helpfulness, visual faithfulness, and ethical considerations. The resulting model Silkie, achieves 6.9% and 9.5% relative improvement on the MME benchmark regarding the perception and cognition capabilities.
arXiv Detail & Related papers (2023-12-17T09:44:27Z)
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation [87.44350003888646]
Eval-Instruct can acquire pointwise grading critiques with pseudo references and revise these critiques via multi-path prompting. CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines.
arXiv Detail & Related papers (2023-11-30T16:52:42Z)
Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.