Related papers: Time Travel: LLM-Assisted Semantic Behavior Localization with Git Bisect

Time Travel: LLM-Assisted Semantic Behavior Localization with Git Bisect

URL: http://arxiv.org/abs/2511.18854v1
Date: Mon, 24 Nov 2025 07:49:59 GMT
Title: Time Travel: LLM-Assisted Semantic Behavior Localization with Git Bisect
Authors: Yujing Wang, Weize Hong,
Abstract summary: We present a novel framework that integrates Large Language Models (LLMs) into the Git bisect process for semantic fault localization.<n>Our system augments bisect traversal with structured chain of thought reasoning, enabling commit by commit analysis under noisy conditions.
Score: 8.55768450285885
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present a novel framework that integrates Large Language Models (LLMs) into the Git bisect process for semantic fault localization. Traditional bisect assumes deterministic predicates and binary failure states assumptions often violated in modern software development due to flaky tests, nonmonotonic regressions, and semantic divergence from upstream repositories. Our system augments bisect traversal with structured chain of thought reasoning, enabling commit by commit analysis under noisy conditions. We evaluate multiple open source and proprietary LLMs for their suitability and fine tune DeepSeekCoderV2 using QLoRA on a curated dataset of semantically labeled diffs. We adopt a weak supervision workflow to reduce annotation overhead, incorporating human in the loop corrections and self consistency filtering. Experiments across multiple open source projects show a 6.4 point absolute gain in success rate from 74.2 to 80.6 percent, leading to significantly fewer failed traversals and by experiment up to 2x reduction in average bisect time. We conclude with discussions on temporal reasoning, prompt design, and finetuning strategies tailored for commit level behavior analysis.

Related papers

DenoiseFlow: Uncertainty-Aware Denoising for Reliable LLM Agentic Workflows [20.319113495948294]
We formalize the multi-step reasoning process as a Noisy MDP.<n>We propose DenoiseFlow, a closed-loop framework that performs progressive denoising through three coordinated stages.
arXiv Detail & Related papers (2026-02-28T08:11:38Z)
CORE: Context-Robust Remasking for Diffusion Language Models [51.59514489363897]
We propose Context-Robust Remasking (CORE), a training-free framework for inference-time revision.<n>Rather than trusting static token probabilities, CORE identifies context-brittle tokens by probing their sensitivity to targeted masked-context perturbations.<n>On LLaDA-8B-Base, CORE delivers consistent improvements across reasoning and code benchmarks, outperforming compute-matched baselines and improving MBPP by up to 9.2 percentage points.
arXiv Detail & Related papers (2026-02-04T00:12:30Z)
Detecting Multiple Semantic Concerns in Tangled Code Commits [1.2578844450585998]
Developers often bundle multiple concerns into tangled commits, obscuring intent and complicating maintenance.<n>Recent studies have used Conventional Commits Specification (CCS) and Language Models (LMs) to capture commit intent.<n>We present an empirical study using SLMs to detect multiple semantic concerns in tangled commits.
arXiv Detail & Related papers (2026-01-29T05:50:16Z)
CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal [84.71254539482369]
Group-relative reinforcement learning with verifiable rewards (RLVR) often wastes the most informative data it already has the failures.<n>We present CARE, a failure-centric post-training framework for multimodal reasoning that turns errors into supervision.<n> CARE improves accuracy and training smoothness while explicitly increasing the share of learning signal that comes from failures.
arXiv Detail & Related papers (2025-12-22T16:34:21Z)
CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection [8.631593963090985]
Version control relies on commit messages to convey the rationale for code changes, but these messages are often low quality and inconsistent with their diffs-known as message-code inconsistency (MCI)<n>We introduce CODEFUSE-COMMITEVAL, the first benchmark designed for MCI detection using large language models (LLMs)<n>We generate seven types of inconsistent messages through rule-guided mutations of originally consistent commits and apply two-fold validation to verify both positive and negative samples.
arXiv Detail & Related papers (2025-11-25T03:33:57Z)
Diffploit: Facilitating Cross-Version Exploit Migration for Open Source Library Vulnerabilities [13.559398564795048]
We propose Diffploit, an iterative, diff-driven exploit migration method structured around two key modules.<n>We evaluate Diffploit on a large-scale dataset containing 102 Java CVEs and 689 version-migration tasks across 79 libraries.<n>It successfully migrates 84.2% exploits, outperforming the change-aware test repair tool TARGET by 52.0% and the rule-based tool in IDEA by 61.6%.
arXiv Detail & Related papers (2025-11-17T04:06:01Z)
SemGuard: Real-Time Semantic Evaluator for Correcting LLM-Generated Code [46.20378145112059]
Post-hoc repair pipelines detect such faults only after execution.<n>We present SemGuard, a semantic-evaluator-driven framework that performs real-time, line-level semantic supervision.
arXiv Detail & Related papers (2025-09-29T09:21:32Z)
Probing Pre-trained Language Models on Code Changes: Insights from ReDef, a High-Confidence Just-in-Time Defect Prediction Dataset [0.0]
We present ReDef, a high-confidence benchmark of function-level modifications curated from 22 large-scale C/C++ projects.<n>Defective cases are anchored by revert commits, while clean cases are validated through post-hoc history checks.<n>This pipeline yields 3,164 defective and 10,268 clean modifications, offering substantially more reliable labels than prior existing resources.
arXiv Detail & Related papers (2025-09-11T07:07:11Z)
LLM-Based Detection of Tangled Code Changes for Higher-Quality Method-Level Bug Datasets [8.166584296080805]
We investigate the utility of Large Language Models for detecting tangled code changes by leveraging both commit messages and method-level code diffs.<n>Our results demonstrate that combining commit messages with code diffs significantly enhances model performance.<n>Applying our approach to 49 open-source projects improves the distributional separability of code metrics between buggy and non-buggy methods.
arXiv Detail & Related papers (2025-05-13T06:26:13Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
RCOT: Detecting and Rectifying Factual Inconsistency in Reasoning by Reversing Chain-of-Thought [56.558892336235914]
Reversing Chain-of-Thought (RCoT) is a novel method to improve large language models' reasoning abilities. RCoT automatically detects and rectifys factual inconsistency in generated solutions. We show that manually written fine-grained feedback can dramatically improve LLMs' reasoning abilities.
arXiv Detail & Related papers (2023-05-19T08:02:52Z)
A Closer Look at Debiased Temporal Sentence Grounding in Videos: Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video. Recent studies have found that current benchmark datasets may have obvious moment annotation biases. We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z)
D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis [55.15995704119158]
We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. We use D2A to generate a large labeled dataset to train models for vulnerability identification.
arXiv Detail & Related papers (2021-02-16T07:46:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.