Related papers: U2F: Encouraging SWE-Agent to Seize Novelty without Losing Feasibility

U2F: Encouraging SWE-Agent to Seize Novelty without Losing Feasibility

URL: http://arxiv.org/abs/2511.03517v1
Date: Wed, 05 Nov 2025 14:46:58 GMT
Title: U2F: Encouraging SWE-Agent to Seize Novelty without Losing Feasibility
Authors: Wencheng Ye, Yan Liu,
Abstract summary: We propose U2F (Unknown Unknowns to Functional solutions), a cognitive-inspired, uncertainty-embracing multi-agent framework.<n>U2F surfaces "Unknown Unknowns" - novel solution pathways absent from initial formulations but holding innovative potential.<n>Human experts reported a 14 percent increase in overall novelty, 51 percent improvement in semantic novelty, and stable feasibility.
Score: 4.711056535735579
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Large language models (LLMs) have shown strong capabilities in software engineering tasks, yet most existing LLM-based SWE-Agents mainly tackle well-defined problems using conventional methods, often overlooking alternative or innovative solutions beyond their predefined frameworks. This limitation is evident in open-world software environments, where emerging challenges transcend established paradigms. We propose U2F (Unknown Unknowns to Functional solutions), a cognitive-inspired, uncertainty-embracing multi-agent framework that systematically surfaces "Unknown Unknowns" - novel solution pathways absent from initial formulations but holding innovative potential. U2F consists of two key components: (1) a Discovery-Exploration-Integration agent system for uncovering and synthesizing potential solutions, and (2) cognitive enhancement mechanisms across three dimensions: cross-domain analogical reasoning, reverse thinking, and external validation, which strategically reframe and extend conventional solution boundaries. Applied to 218 real-world software enabler stories curated from authentic engineering tasks, U2F achieved notable improvements: human experts reported a 14 percent increase in overall novelty, 51 percent improvement in semantic novelty, and stable feasibility (4.02/5.0), corroborated by an LLM-based evaluator. These results highlight the potential of embracing uncertainty as a catalyst for innovation in software engineering.

Related papers

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? [61.247730037229815]
We introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope.<n>To investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities.<n>This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.
arXiv Detail & Related papers (2026-03-03T17:52:01Z)
Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs [50.075587392477935]
We conduct the first large-scale empirical study of 705 real-world failures from the open-source DeepSeek, Llama, and Qwen ecosystems.<n>Our analysis reveals a paradigm shift: white-box orchestration relocates the reliability bottleneck from model algorithmic defects to the systemic fragility of the deployment stack.
arXiv Detail & Related papers (2026-01-20T06:42:56Z)
AI-NativeBench: An Open-Source White-Box Agentic Benchmark Suite for AI-Native Systems [52.65695508605237]
We introduce AI-NativeBench, the first application-centric and white-box AI-Native benchmark suite grounded in Model Context Protocol (MCP) and Agent-to-Agent (A2A) standards.<n>By treating agentic spans as first-class citizens within distributed traces, our methodology enables granular analysis of engineering characteristics beyond simple capabilities.<n>This work provides the first systematic evidence to guide the transition from measuring model capability to engineering reliable AI-Native systems.
arXiv Detail & Related papers (2026-01-14T11:32:07Z)
LUCID: Learning-Enabled Uncertainty-Aware Certification of Stochastic Dynamical Systems [0.8574682463936006]
Traditional formal verification tools fall short when faced with systems that embed both opaque black-box AI components and complex dynamics.<n>We introduce LUCID, a verification engine for certifying safety of black-box embedding dynamical systems.<n> LUCID is the first known tool capable of establishing quantified safety guarantees for such systems.
arXiv Detail & Related papers (2025-12-12T17:46:50Z)
InnoGym: Benchmarking the Innovation Potential of AI Agents [74.64144272881414]
InnoGym is the first benchmark designed to evaluate the innovation potential of AI agents.<n>InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches.
arXiv Detail & Related papers (2025-12-01T16:03:04Z)
Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents [0.4666493857924358]
Multi-turn tool-calling LLMs have emerged as a key feature in modern AI assistants.<n>Implementing multi-turn pipelines remains difficult for many safety-critical industries.<n>There is still a lack of visibility into multi-turn conversation-level robustness.
arXiv Detail & Related papers (2025-11-29T05:44:37Z)
An Agentic Framework with LLMs for Solving Complex Vehicle Routing Problems [66.60904891478687]
We propose an Agentic Framework with LLMs (AFL) for solving complex vehicle routing problems.<n>AFL directly extracts knowledge from raw inputs and enables self-contained code generation.<n>We show that AFL substantially outperforms existing LLM-based baselines in both code reliability and solution feasibility.
arXiv Detail & Related papers (2025-10-19T03:59:25Z)
MARS: Optimizing Dual-System Deep Research via Multi-Agent Reinforcement Learning [82.14973479594367]
Large Language Models (LLMs) for complex reasoning tasks require innovative approaches that bridge intuitive and deliberate cognitive processes.<n>This paper introduces a Multi-Agent System for Deep ReSearch (MARS) enabling seamless integration of System 1's fast, intuitive thinking with System 2's deliberate reasoning.
arXiv Detail & Related papers (2025-10-06T15:42:55Z)
Algorithm Generation via Creative Ideation [4.174203390496298]
We introduce MetaMuse, a framework for creative ideation built on three self-reflection principles.<n>We show that MetaMuse can generate high-performing solutions for two critical problems at a global cloud provider.
arXiv Detail & Related papers (2025-10-04T15:52:31Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
QUBE: Enhancing Automatic Heuristic Design via Quality-Uncertainty Balanced Evolution [14.131178103518907]
Quality-Uncertainty Balanced Evolution (QUBE) is a novel approach that enhances LLM+EA methods by redefining the priority criterion within the FunSearch framework.<n>QUBE employs the Quality-Uncertainty Trade-off Criterion (QUTC), based on our proposed Uncertainty-Inclusive Quality metric.<n>Through extensive experiments on challenging NP-complete problems, QUBE demonstrates significant performance improvements over FunSearch and baseline methods.
arXiv Detail & Related papers (2024-12-30T04:05:22Z)
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models [104.94706600050557]
Text-to-image (T2I) models have shown remarkable progress, but their potential to generate harmful content remains a critical concern in the ML community.<n>We propose ICER, a novel red-teaming framework that generates interpretable and semantic meaningful problematic prompts.<n>Our work provides crucial insights for developing more robust safety mechanisms in T2I systems.
arXiv Detail & Related papers (2024-11-25T04:17:24Z)
Enhanced POET: Open-Ended Reinforcement Learning through Unbounded Invention of Learning Challenges and their Solutions [20.671903144896742]
Paired Open-Ended Trailblazer (POET) is an algorithm that generates and solves its own challenges. POET was unable to demonstrate its full creative potential because of limitations of the algorithm itself. We introduce and empirically validate two new innovations to the original algorithm, as well as two external innovations designed to help elucidate its full potential.
arXiv Detail & Related papers (2020-03-19T01:35:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.