Related papers: Sphinx: Benchmarking and Modeling for LLM-Driven Pull Request Review

Sphinx: Benchmarking and Modeling for LLM-Driven Pull Request Review

URL: http://arxiv.org/abs/2601.04252v1
Date: Tue, 06 Jan 2026 18:49:56 GMT
Title: Sphinx: Benchmarking and Modeling for LLM-Driven Pull Request Review
Authors: Daoan Zhang, Shuo Zhang, Zijian Jin, Jiebo Luo, Shengyu Fu, Elsie Nallipogu,
Abstract summary: Pull request (PR) review is essential for ensuring software quality, yet it remains challenging due to noisy supervision, limited contextual understanding, and inadequate evaluation metrics.<n>We present Sphinx, a unified framework for PR review that addresses these limitations through three key components: (1) a structured data generation pipeline that produces context-rich, semantically grounded review comments by comparing pseudo-modified and merged code; (2) a checklist-based evaluation benchmark that assesses review quality based on structured coverage of actionable verification points; and (3) Checklist Reward Policy Optimization (CRPO), a novel training paradigm that uses rule-based, interpretable rewards to align model behavior with real
Score: 37.98161722413899
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pull request (PR) review is essential for ensuring software quality, yet automating this task remains challenging due to noisy supervision, limited contextual understanding, and inadequate evaluation metrics. We present Sphinx, a unified framework for LLM-based PR review that addresses these limitations through three key components: (1) a structured data generation pipeline that produces context-rich, semantically grounded review comments by comparing pseudo-modified and merged code; (2) a checklist-based evaluation benchmark that assesses review quality based on structured coverage of actionable verification points, moving beyond surface-level metrics like BLEU; and (3) Checklist Reward Policy Optimization (CRPO), a novel training paradigm that uses rule-based, interpretable rewards to align model behavior with real-world review practices. Extensive experiments show that models trained with Sphinx achieve state-of-the-art performance on review completeness and precision, outperforming both proprietary and open-source baselines by up to 40\% in checklist coverage. Together, Sphinx enables the development of PR review models that are not only fluent but also context-aware, technically precise, and practically deployable in real-world development workflows. The data will be released after review.

Related papers

On the Factual Consistency of Text-based Explainable Recommendation Models [2.2153783542347805]
We introduce a comprehensive framework for evaluating the factual consistency of text-based explainable recommenders.<n>We design a prompting-based pipeline that uses LLMs to extract atomic explanatory statements from reviews.<n>We propose statement-level alignment metrics that combine LLM- and NLI-based approaches to assess both factual consistency and relevance of generated explanations.
arXiv Detail & Related papers (2025-12-30T17:25:15Z)
Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice [18.222990693059756]
ContextCRBench is a benchmark for fine-grained LLM evaluation in code review.<n>It collects 153.7K issues and pull requests from top-tier repositories.<n>It supports three evaluation scenarios aligned with the review workflow.
arXiv Detail & Related papers (2025-11-10T12:06:35Z)
CodeFuse-CR-Bench: A Comprehensiveness-aware Benchmark for End-to-End Code Review Evaluation in Python Projects [23.9752442213364]
We introduce CodeFuse-CR-Bench, the first comprehensiveness-aware benchmark for repository-level CR evaluation.<n>CodeFuse-CR-Bench comprises 601 high-quality instances from 70 Python projects covering nine Pull-Request (PR) problem domains.<n>We present the first large-scale assessment of state-of-the-art Large Language Models (LLMs) on this comprehensive CR task.
arXiv Detail & Related papers (2025-09-18T11:24:09Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Meeseeks: A Feedback-Driven, Iterative Self-Correction Benchmark evaluating LLMs' Instruction Following Capability [21.96694731466089]
We introduce Meeseeks, a fully automated instruction-following benchmark equipped with an integrated feedback mechanism.<n>Meeseeks identifies erroneous components in model responses and provides corresponding feedback accurately, thereby iteratively guiding the model toward self-correction.<n>We conducted comprehensive analysis from both macro and instance levels, uncovering numerous common issues prevalent in current state-of-the-art models.
arXiv Detail & Related papers (2025-04-30T13:28:19Z)
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z)
The Vulnerability of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? [1.3810901729134184]
Large Language Models (LLMs) excel at standardized tests while failing to demonstrate genuine language understanding and adaptability.<n>Our systematic analysis of NLP evaluation frameworks reveals pervasive vulnerabilities across the evaluation spectrum.<n>We lay the groundwork for new evaluation methods that resist manipulation, minimize data contamination, and assess domain-specific tasks.
arXiv Detail & Related papers (2024-12-02T20:49:21Z)
OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs [59.836774258359945]
OpenFactCheck is a framework for building customized automatic fact-checking systems.<n>It allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims.<n>CheckerEVAL is a solution for gauging the reliability of automatic fact-checkers' verification results using human-annotated datasets.
arXiv Detail & Related papers (2024-05-09T07:15:19Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Unveiling the Deficiencies of Pre-trained Text-and-Layout Models in Real-world Visually-rich Document Information Extraction [19.083538884467917]
We introduce EC-FUNSD, an entity-centric dataset crafted for benchmarking information extraction from visually-rich documents.<n>We evaluate the real-world information extraction capabilities of PTLMs from multiple aspects, including their absolute performance, as well as generalization, robustness and fairness.
arXiv Detail & Related papers (2024-02-04T07:33:45Z)
Generating Benchmarks for Factuality Evaluation of Language Models [61.69950787311278]
We propose FACTOR: Factual Assessment via Corpus TransfORmation, a scalable approach for evaluating LM factuality. FACTOR automatically transforms a factual corpus of interest into a benchmark evaluating an LM's propensity to generate true facts from the corpus vs. similar but incorrect statements. We show that: (i) our benchmark scores increase with model size and improve when the LM is augmented with retrieval; (ii) benchmark score and perplexity do not always agree on model ranking; (iii) when perplexity and benchmark score disagree, the latter better reflects factuality in open-ended generation.
arXiv Detail & Related papers (2023-07-13T17:14:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.