Multi-Modal Requirements Data-based Acceptance Criteria Generation using LLMs
- URL: http://arxiv.org/abs/2508.06888v1
- Date: Sat, 09 Aug 2025 08:35:40 GMT
- Title: Multi-Modal Requirements Data-based Acceptance Criteria Generation using LLMs
- Authors: Fanyu Wang, Chetan Arora, Yonghui Liu, Kaicheng Huang, Chakkrit Tantithamthavorn, Aldeida Aleti, Dishan Sambathkumar, David Lo,
- Abstract summary: We propose RAGcceptance M2RE, a novel approach to generate acceptance criteria from multi-modal requirements data.<n>We show that our approach effectively reduces manual effort, captures nuanced stakeholder intent, and provides valuable criteria.<n>This research underscores the potential of multi-modal RAG techniques in streamlining software validation processes and improving development efficiency.
- Score: 17.373348983049176
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Acceptance criteria (ACs) play a critical role in software development by clearly defining the conditions under which a software feature satisfies stakeholder expectations. However, manually creating accurate, comprehensive, and unambiguous acceptance criteria is challenging, particularly in user interface-intensive applications, due to the reliance on domain-specific knowledge and visual context that is not always captured by textual requirements alone. To address these challenges, we propose RAGcceptance M2RE, a novel approach that leverages Retrieval-Augmented Generation (RAG) to generate acceptance criteria from multi-modal requirements data, including both textual documentation and visual UI information. We systematically evaluated our approach in an industrial case study involving an education-focused software system used by approximately 100,000 users. The results indicate that integrating multi-modal information significantly enhances the relevance, correctness, and comprehensibility of the generated ACs. Moreover, practitioner evaluations confirm that our approach effectively reduces manual effort, captures nuanced stakeholder intent, and provides valuable criteria that domain experts may overlook, demonstrating practical utility and significant potential for industry adoption. This research underscores the potential of multi-modal RAG techniques in streamlining software validation processes and improving development efficiency. We also make our implementation and a dataset available.
Related papers
- Applying a Requirements-Focused Agile Management Approach for Machine Learning-Enabled Systems [1.3704574906282525]
Machine Learning (ML)-enabled systems challenge traditional Requirements Engineering (RE) and agile management.<n>Existing RE and agile practices remain poorly integrated and insufficiently tailored to these characteristics.<n>This paper reports on the practical experience of applying RefineML, a requirements-focused approach for the continuous and agile refinement of ML-enabled systems.
arXiv Detail & Related papers (2026-02-04T20:49:02Z) - Benchmarking Agents in Insurance Underwriting Environments [0.9728664856449597]
Existing benchmarks overemphasize open-domains such as code, use narrow accuracy metrics, and lack authentic complexity.<n>We present UNDERWRITE, an expert-first, multi-turn insurance underwriting benchmark designed in close collaboration with domain experts.
arXiv Detail & Related papers (2026-01-31T02:12:11Z) - Query Generation Pipeline with Enhanced Answerability Assessment for Financial Information Retrieval [1.6109077391631914]
Existing benchmarks fail to capture the complex and domain-specific information needs of real-world banking scenarios.<n>We propose a systematic methodology for constructing domain-specific IR benchmarks through LLM-based query generation.<n>Our experiments show that existing retrieval models struggle with the complex multi-document queries in KoBankIR.
arXiv Detail & Related papers (2025-11-07T06:06:09Z) - OutboundEval: A Dual-Dimensional Benchmark for Expert-Level Intelligent Outbound Evaluation of Xbench's Professional-Aligned Series [36.88936933010042]
OutboundEval is a comprehensive benchmark for evaluating large language models (LLMs) in intelligent outbound calling scenarios.<n>We design a benchmark spanning six major business domains and 30 representative sub-scenarios, each with scenario-specific process decomposition, weighted scoring, and domain-adaptive metrics.<n>We introduce a dynamic evaluation method that adapts to task variations, integrating automated and human-in-the-loop assessment to measure task execution accuracy, professional knowledge application, adaptability, and user experience quality.
arXiv Detail & Related papers (2025-10-24T08:27:58Z) - Towards Context-aware Reasoning-enhanced Generative Searching in E-commerce [61.03081096959132]
We propose a context-aware reasoning-enhanced generative search framework for better textbfunderstanding the complicated context.<n>Our approach achieves superior performance compared with strong baselines, validating its effectiveness for search-based recommendation.
arXiv Detail & Related papers (2025-10-19T16:46:11Z) - Engineering RAG Systems for Real-World Applications: Design, Development, and Evaluation [4.448709087838503]
Retrieval-Augmented Generation (RAG) systems are emerging as a key approach for grounding Large Language Models (LLMs) in external knowledge.<n>This paper presents five domain-specific RAG applications developed for real-world scenarios across governance, cybersecurity, agriculture, industrial research, and medical diagnostics.
arXiv Detail & Related papers (2025-06-25T22:40:00Z) - A Practical Guide for Evaluating LLMs and LLM-Reliant Systems [1.1715858161748576]
We present a practical evaluation framework which outlines how to proactively curate representative datasets and select meaningful evaluation metrics.<n>We employ meaningful evaluation methodologies that integrate well with practical development and deployment of systems that must adhere to real-world requirements and meet user-facing needs.
arXiv Detail & Related papers (2025-06-16T01:18:16Z) - EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models [65.48902212293903]
We present the Extremely Complex Instruction Following Benchmark (EIFBENCH) for evaluating large language models (LLMs)<n>EIFBENCH includes multi-task scenarios that enable comprehensive assessment across diverse task types concurrently.<n>We also propose the Segment Policy Optimization (SegPO) algorithm to enhance the LLM's ability to accurately fulfill multi-task workflow.
arXiv Detail & Related papers (2025-06-10T02:39:55Z) - Rethinking Machine Unlearning in Image Generation Models [59.697750585491264]
CatIGMU is a novel hierarchical task categorization framework.<n>EvalIGMU is a comprehensive evaluation framework.<n>We construct DataIGM, a high-quality unlearning dataset.
arXiv Detail & Related papers (2025-06-03T11:25:14Z) - A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs [2.7905014064567344]
Large Language Models (LLMs) have led to significant improvements in various service domains.<n>Applying state-of-the-art (SOTA) research to industrial settings presents challenges.
arXiv Detail & Related papers (2025-05-29T02:30:27Z) - Composed Multi-modal Retrieval: A Survey of Approaches and Applications [81.54640206021757]
Composed Multi-modal Retrieval (CMR) emerges as a pivotal next-generation technology.<n>CMR enables users to query images or videos by integrating a reference visual input with textual modifications.<n>This paper provides a comprehensive survey of CMR, covering its fundamental challenges, technical advancements, and applications.
arXiv Detail & Related papers (2025-03-03T09:18:43Z) - Leveraging Graph-RAG and Prompt Engineering to Enhance LLM-Based Automated Requirement Traceability and Compliance Checks [8.354305051472735]
This study demonstrates that integrating a robust Graph-RAG framework with advanced prompt engineering techniques, such as Chain of Thought and Tree of Thought, can significantly enhance performance.<n>It is both costly and more complex to implement across diverse contexts, requiring careful adaptation to specific scenarios.
arXiv Detail & Related papers (2024-12-11T18:11:39Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - LEARN: Knowledge Adaptation from Large Language Model to Recommendation for Practical Industrial Application [54.984348122105516]
Llm-driven knowlEdge Adaptive RecommeNdation (LEARN) framework synergizes open-world knowledge with collaborative knowledge.<n>We propose an Llm-driven knowlEdge Adaptive RecommeNdation (LEARN) framework that synergizes open-world knowledge with collaborative knowledge.
arXiv Detail & Related papers (2024-05-07T04:00:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.