Related papers: Automated Repair of Ambiguous Natural Language Requirements

Automated Repair of Ambiguous Natural Language Requirements

URL: http://arxiv.org/abs/2505.07270v2
Date: Sat, 07 Jun 2025 13:36:11 GMT
Title: Automated Repair of Ambiguous Natural Language Requirements
Authors: Haoxiang Jia, Robbie Morris, He Ye, Federica Sarro, Sergey Mechtaev,
Abstract summary: Large language models (LLMs) in software engineering have amplified the role of natural language (NL)<n>We introduce automated repair of ambiguous NL requirements, which we approach by reducing code generation uncertainty.<n>Our results show that SpecFix modifies 23.93% of the requirements, leading to a 33.66% improvement in model Pass@1 on the modified requirements.
Score: 9.379494157034083
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The widespread adoption of large language models (LLMs) in software engineering has amplified the role of natural language (NL). The inherent ambiguity of NL threatens software quality, because ambiguous requirements may lead to faulty program generation. The complexity of ambiguity detection and resolution motivates us to introduce automated repair of ambiguous NL requirements, which we approach by reducing code generation uncertainty and aligning NL with input-output examples. Repairing ambiguity in requirements is a difficult challenge for LLMs, as it demands metacognition - the model must understand how its own interpretation changes when the text is altered. Our experiments show that directly prompting an LLM to detect and resolve ambiguities results in irrelevant or inconsistent clarifications. Our key insight is to decompose this problem into simpler sub-problems that do not require metacognitive reasoning. First, we analyze and repair the LLM's interpretation of requirements embodied by the distribution of programs they induce by using traditional testing and program repair. Second, we repair requirements based on the changes to the distribution via contrastive specification inference. We implemented this proposal, dubbed as SpecFix, and evaluated it by using three state-of-the-art LLMs (GPT-4o, DeepSeek-V3 and Qwen2.5-Coder-32b) across two widely used code generation benchmarks, namely HumanEval+ and MBPP+. Our results show that SpecFix, operating autonomously without human intervention or external information, modifies 23.93% of the requirements, leading to a 33.66% improvement in model Pass@1 on the modified requirements. Across the entire benchmark, this corresponds to an 4.3% increase in overall Pass@1. Importantly, SpecFix's repairs generalize across models: requirements repaired by one model boost the performance of other models by 9.6%.

Related papers

Grammar-Guided Evolutionary Search for Discrete Prompt Optimisation [63.97051732013936]
We propose an evolutionary search approach to automated discrete prompt optimisation consisting of two phases.<n>In the first phase, grammar-guided genetic programming is invoked to synthesise prompt-creating programmes.<n>In the second phase, local search is applied to explore the neighbourhoods of best-performing programmes.
arXiv Detail & Related papers (2025-07-14T14:34:15Z)
Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs [84.30534714651093]
We present an innovative APR tool for Dafny, a verification-aware programming language.<n>We localize faults through a series of steps, which include using Hoare Logic to determine the state of each statement within the program.<n>We evaluate our approach using DafnyBench, a benchmark of real-world Dafny programs.
arXiv Detail & Related papers (2025-07-04T15:36:12Z)
MCTS-Refined CoT: High-Quality Fine-Tuning Data for LLM-Based Repository Issue Resolution [18.314436803012434]
The paper proposes MCTS-INE, an enhanced Monte Carlo Tree Search (MCTS)-based algorithm that dynamically validates and optimize intermediate reasoning steps.<n> Experiments on SWE-bench Lite and SWE-bench Verified demonstrate that LLMs fine-tuned with our CoT dataset achieve substantial improvements over baselines.
arXiv Detail & Related papers (2025-06-15T05:42:01Z)
Fewer Hallucinations, More Verification: A Three-Stage LLM-Based Framework for ASR Error Correction [4.304383298057423]
We propose the Reliable Correction Framework (RLLM-CF), which consists of three stages: error pre-detection, chain-of-thought sub-tasks iterative correction, and reasoning process verification.<n>Experiments on AISHELL-1, AISHELL-2, and Librispeech show that the GPT-4o model enhanced by our framework achieves 21%, 11%, 9%, and 11.4% relative reductions in CER/WER.
arXiv Detail & Related papers (2025-05-30T08:40:49Z)
Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask [30.819697001992154]
Large Language Models are a promising tool for automated vulnerability detection.<n>Despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities?<n>This paper challenges three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales.
arXiv Detail & Related papers (2025-04-18T05:32:47Z)
LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs.<n>We introduce LLM2, a novel framework that combines an LLM with a process-based verifier.<n>LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z)
A Real-World Benchmark for Evaluating Fine-Grained Issue Solving Capabilities of Large Language Models [11.087034068992653]
FAUN-Eval is a benchmark specifically designed to evaluate the Fine-grAined issUe solviNg capabilities of LLMs.<n>It is constructed using a dataset curated from 30 well-known GitHub repositories.<n>We evaluate ten LLMs with FAUN-Eval, including four closed-source and six open-source models.
arXiv Detail & Related papers (2024-11-27T03:25:44Z)
Instruct or Interact? Exploring and Eliciting LLMs' Capability in Code Snippet Adaptation Through Prompt Engineering [19.019004855931676]
Large language models (LLMs) have confirmed their effectiveness in the code generation task with promising results. Their performance on adaptation, a reuse-oriented and context-dependent code change prediction task, is still unclear. We propose an interactive prompting approach to elicit LLMs' adaptation ability.
arXiv Detail & Related papers (2024-11-23T09:40:36Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering [8.019873464066308]
We introduce two metrics for classification tasks, namely sensitivity and consistency.<n> sensitivity measures changes of predictions across rephrasings of the prompt.<n>Instead, consistency measures how predictions vary across rephrasings for elements of the same class.
arXiv Detail & Related papers (2024-06-18T06:59:24Z)
PECC: Problem Extraction and Coding Challenges [3.287942619833188]
We introduce PECC, a novel benchmark derived from Advent Of Code (AoC) challenges and Project Euler. Unlike conventional benchmarks, PECC requires LLMs to interpret narrative-embedded problems, extract requirements, and generate code. Results show varying model performance between narrative and neutral problems, with specific challenges in the Euler math-based subset.
arXiv Detail & Related papers (2024-04-29T15:02:14Z)
Enchanting Program Specification Synthesis by Large Language Models using Static Analysis and Program Verification [15.686651364655958]
AutoSpec is an automated approach to synthesize specifications for automated program verification. It overcomes the shortcomings of existing work in specification versatility, synthesizing satisfiable and adequate specifications for full proof. It can be successfully applied to verify the programs in a real-world X509-parser project.
arXiv Detail & Related papers (2024-03-31T18:15:49Z)
DeepCode AI Fix: Fixing Security Vulnerabilities with Large Language Models [3.1690235522182104]
Large language models (LLMs) are increasingly used to solve various programming tasks. We show that the task is difficult as it requires the model to learn long-range code relationships. We propose a technique to address these challenges with a new approach for querying and fine-tuning LLMs.
arXiv Detail & Related papers (2024-02-19T18:35:40Z)
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z)
Simultaneous Machine Translation with Large Language Models [51.470478122113356]
We investigate the possibility of applying Large Language Models to SimulMT tasks. We conducted experiments using the textttLlama2-7b-chat model on nine different languages from the MUST-C dataset. The results show that LLM outperforms dedicated MT models in terms of BLEU and LAAL metrics.
arXiv Detail & Related papers (2023-09-13T04:06:47Z)
Improving Open Information Extraction with Large Language Models: A Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text. Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z)
Information Association for Language Model Updating by Mitigating LM-Logical Discrepancy [68.31760483418901]
Large Language Models(LLMs) struggle with providing current information due to the outdated pre-training data. Existing methods for updating LLMs, such as knowledge editing and continual fine-tuning, have significant drawbacks in generalizability of new information. We identify the core challenge behind these drawbacks: the LM-logical discrepancy featuring the difference between language modeling probabilities and logical probabilities.
arXiv Detail & Related papers (2023-05-29T19:48:37Z)
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models. We propose a soft prompt learning method where we expose the compressed model to the prompt learning process. Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z)
SatLM: Satisfiability-Aided Language Models Using Declarative Prompting [68.40726892904286]
We propose a new satisfiability-aided language modeling (SatLM) approach for improving the reasoning capabilities of large language models (LLMs) We use an LLM to generate a declarative task specification rather than an imperative program and leverage an off-the-shelf automated theorem prover to derive the final answer. We evaluate SATLM on 8 different datasets and show that it consistently outperforms program-aided LMs in the imperative paradigm.
arXiv Detail & Related papers (2023-05-16T17:55:51Z)
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks. This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.