Related papers: UCRBench: Benchmarking LLMs on Use Case Recovery

UCRBench: Benchmarking LLMs on Use Case Recovery

URL: http://arxiv.org/abs/2512.13360v1
Date: Mon, 15 Dec 2025 14:12:57 GMT
Title: UCRBench: Benchmarking LLMs on Use Case Recovery
Authors: Shuyuan Xiao, Yiran Zhang, Weisong Sun, Xiaohong Chen, Yang Liu, Zhi Jin,
Abstract summary: We introduce code-aligned use case benchmarks, constructed through manual validation of both user-goal and subfunction use cases.<n>We conduct the first systematic study of large language models (LLMs) and propose a hierarchical evaluation protocol.<n>The results show that while LLMs can partially reconstruct system functionality, their performance varies significantly across projects.
Score: 42.35653533011503
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Use cases are widely employed to specify functional requirements, yet existing benchmarks are scarce and face the risk of being misaligned with actual system behavior, similarly limiting the rigorous evaluation of large language models (LLMs) in generating use cases from source code. We address this gap by introducing code-aligned use case benchmarks, constructed through manual validation of both user-goal and subfunction use cases across nine real-world software projects. Using this benchmark, we conduct the first systematic study of LLMs and propose a hierarchical evaluation protocol that assesses actor correctness, name accuracy, path fidelity, and behavioral coverage. The results show that while LLMs can partially reconstruct system functionality, their performance varies significantly across projects, with particularly noticeable shortcomings in domain-specific and multi-module systems. The models also exhibit high omission rates and struggle to maintain consistent abstraction when aggregating subfunctions into user-goal use cases, highlighting both the potential and current limitations of LLM-based use case reverse engineering.

Related papers

Are LLMs Reliable Code Reviewers? Systematic Overcorrection in Requirement Conformance Judgement [8.059802912761919]
We uncover a systematic failure of large language models (LLMs) in matching code to natural language requirements.<n>More detailed prompt design, particularly with those requiring explanations and proposed corrections, leads to higher misjudgment rates.<n>We propose a Fix-guided Verification Filter that treats the model proposed fix as executable counterfactual evidence.
arXiv Detail & Related papers (2026-02-28T08:35:25Z)
Case-Aware LLM-as-a-Judge Evaluation for Enterprise-Scale RAG Systems [0.0]
We present a case-aware LLM-as-a-Judge evaluation framework for enterprise multi-turn RAG systems.<n>The framework evaluates each turn using eight operationally grounded metrics that separate retrieval quality, grounding fidelity, answer utility, precision integrity, and case/workflow alignment.
arXiv Detail & Related papers (2026-02-23T21:37:06Z)
On Selecting Few-Shot Examples for LLM-based Code Vulnerability Detection [8.460805514983816]
Large language models (LLMs) have demonstrated impressive capabilities for many coding tasks.<n> detecting code vulnerabilities remains a challenging task for LLMs.<n>In-context learning (ICL) provides few-shot examples similar to the query, along with correct answers.
arXiv Detail & Related papers (2025-10-31T17:41:58Z)
LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation [110.610512800947]
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge.<n>Existing studies often treat utility as a generic attribute, ignoring the fact that different LLMs may benefit differently from the same passage.
arXiv Detail & Related papers (2025-10-13T12:57:45Z)
Evaluating Large Language Models for Functional and Maintainable Code in Industrial Settings: A Case Study at ASML [3.5515013986822073]
We present a case study conducted in collaboration with the leveling department at A.<n>We investigate the performance of LLMs in generating functional, maintainable code within a closed, highly specialized software environment.<n>The findings reveal that prompting techniques and model size have a significant impact on output quality.
arXiv Detail & Related papers (2025-09-15T19:39:26Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
On LLM-Assisted Generation of Smart Contracts from Business Processes [0.08192907805418582]
Large language models (LLMs) have changed the reality of how software is produced.<n>We present an exploratory study to investigate the use of LLMs for generating smart contract code from business process descriptions.<n>Our results show that LLM performance falls short of the perfect reliability required for smart contract development.
arXiv Detail & Related papers (2025-07-30T20:39:45Z)
CLEAR: Error Analysis via LLM-as-a-Judge Made Easy [9.285203198113917]
We introduce CLEAR, an interactive, open-source package for LLM-based error analysis.<n> CLEAR first generates per-instance textual feedback, then creates a set of system-level error issues, and quantifies the prevalence of each identified issue.<n>Our package also provides users with an interactive dashboard that allows for a comprehensive error analysis through aggregate visualizations.
arXiv Detail & Related papers (2025-07-24T13:15:21Z)
Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs [54.309127753635366]
We present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic actions.<n>Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.
arXiv Detail & Related papers (2025-01-28T15:41:54Z)
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making [85.24399869971236]
We aim to evaluate Large Language Models (LLMs) for embodied decision making.<n>Existing evaluations tend to rely solely on a final success rate.<n>We propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks.
arXiv Detail & Related papers (2024-10-09T17:59:00Z)
RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
An Actionable Framework for Assessing Bias and Fairness in Large Language Model Use Cases [0.0]
Large language models (LLMs) can exhibit bias in a variety of ways.<n>We propose a decision framework that allows practitioners to determine which bias and fairness metrics to use for a specific use case.
arXiv Detail & Related papers (2024-07-15T16:04:44Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.