Related papers: Classification or Prompting: A Case Study on Legal Requirements Traceability

Classification or Prompting: A Case Study on Legal Requirements Traceability

URL: http://arxiv.org/abs/2502.04916v2
Date: Tue, 11 Feb 2025 13:16:29 GMT
Title: Classification or Prompting: A Case Study on Legal Requirements Traceability
Authors: Romina Etezadi, Sallam Abualhaija, Chetan Arora, Lionel Briand,
Abstract summary: New regulations are continuously introduced to ensure that software development complies with the ethical concerns and prioritizes public safety.<n>A prerequisite for demonstrating compliance involves tracing software requirements to legal provisions.<n>This paper investigates two automated solutions to predict trace links between requirements and legal provisions.
Score: 6.411835643029738
License: http://creativecommons.org/licenses/by/4.0/
Abstract: New regulations are continuously introduced to ensure that software development complies with the ethical concerns and prioritizes public safety. A prerequisite for demonstrating compliance involves tracing software requirements to legal provisions. Requirements traceability is a fundamental task where requirements engineers are supposed to analyze technical requirements against target artifacts, often under limited time budget. Doing this analysis manually for complex systems with hundreds of requirements is infeasible. The legal dimension introduces additional challenges that only exacerbate manual effort. In this paper, we investigate two automated solutions based on large language models (LLMs) to predict trace links between requirements and legal provisions. The first solution, Kashif, is a classifier that leverages sentence transformers. The second solution prompts a recent generative LLM based on Rice, a prompt engineering framework. On a benchmark dataset, we empirically evaluate Kashif and compare it against a baseline classifier from the literature. Kashif can identify trace links with an average recall of ~67%, outperforming the baseline with a substantial gain of 54 percentage points (pp) in recall. However, on unseen, more complex requirements documents traced to the European general data protection regulation (GDPR), Kashif performs poorly, yielding an average recall of 15%. On the same documents, however, our Rice-based solution yields an average recall of 84%, with a remarkable gain of about 69 pp over Kashif. Our results suggest that requirements traceability in the legal context cannot be simply addressed by building classifiers, as such solutions do not generalize and fail to perform well on complex regulations and requirements. Resorting to generative LLMs, with careful prompt engineering, is thus a more promising alternative.

Related papers

Evaluating and Improving Large Language Models for Competitive Program Generation [18.564450345359468]
This study aims to evaluate and improve large language models (LLMs) in solving real-world competitive programming problems.<n>We collect 117 problems from nine regional ICPC/CCPC contests held in 2024 and design four filtering criteria to construct a curated benchmark consisting of 80 problems.<n>We evaluate its competitive program generation capabilities through the online judge (OJ) platforms, guided by a carefully designed basic prompt.
arXiv Detail & Related papers (2025-06-28T17:18:23Z)
ReqBrain: Task-Specific Instruction Tuning of LLMs for AI-Assisted Requirements Generation [4.475603469482274]
Software engineers can engage with ReqBrain through chat-based sessions to automatically generate software requirements.<n>Top-performing model, Zephyr-7b-beta, achieved 89.30% Fl using the BERT score and a FRUGAL score of 91.20 in generating authentic and adequate requirements.
arXiv Detail & Related papers (2025-05-23T08:45:46Z)
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z)
Automated Repair of Ambiguous Natural Language Requirements [9.379494157034083]
Large language models (LLMs) in software engineering have amplified the role of natural language (NL)<n>We introduce automated repair of ambiguous NL requirements, which we approach by reducing code generation uncertainty.<n>Our results show that SpecFix modifies 23.93% of the requirements, leading to a 33.66% improvement in model Pass@1 on the modified requirements.
arXiv Detail & Related papers (2025-05-12T06:47:53Z)
TVR: Automotive System Requirement Traceability Validation and Recovery Through Retrieval-Augmented Generation [7.50061902435987]
Traceability between stakeholder requirements and system requirements is crucial to ensure consistency, correctness, and regulatory compliance. Existing approaches do not address traceability between stakeholder and system requirements, rely on open-source data, and do not address the validation of manual links established by engineers. We introduce TVR, a requirement Traceability Validation and Recovery approach primarily targeting automotive systems.
arXiv Detail & Related papers (2025-04-21T20:37:23Z)
Retrieval-Augmented Generation with Conflicting Evidence [57.66282463340297]
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses. In practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources. We propose RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query.
arXiv Detail & Related papers (2025-04-17T16:46:11Z)
Evaluating Retrieval Augmented Generative Models for Document Queries in Transportation Safety [0.7373617024876725]
This study evaluates the performance of three fine-tuned generative models, ChatGPT, Google's Vertex AI, and ORNL Retrieval Augmented Generation augmented LLaMA 2 and LLaMA. We developed 100 realistic queries relevant to route planning and permitting requirements. Results demonstrated that the RAG-augmented LLaMA models significantly outperformed Vertex AI and ChatGPT, providing more detailed and generally accurate information.
arXiv Detail & Related papers (2025-04-09T16:37:03Z)
SUNAR: Semantic Uncertainty based Neighborhood Aware Retrieval for Complex QA [2.7703990035016868]
We introduce SUNAR, a novel approach that leverages large language models to guide a Neighborhood Aware Retrieval process. We validate our approach through extensive experiments on two complex QA datasets. Our results show that SUNAR significantly outperforms existing retrieve-and-reason baselines, achieving up to a 31.84% improvement in performance.
arXiv Detail & Related papers (2025-03-23T08:50:44Z)
An Empirical Study on LLM-based Classification of Requirements-related Provisions in Food-safety Regulations [3.1776778131016368]
We conduct a Grounded Theory study of food-safety regulations.<n>We develop a conceptual characterization of food-safety concepts that closely relate to systems and software requirements.<n>We examine the effectiveness of two families of large language models (LLMs) in automatically classifying legal provisions.
arXiv Detail & Related papers (2025-01-24T17:59:14Z)
The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility. This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance. We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z)
MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation [34.66546005629471]
Large Language Models (LLMs) are essential tools for various natural language processing tasks but often suffer from generating outdated or incorrect information.<n>Retrieval-Augmented Generation (RAG) addresses this issue by incorporating external, real-time information retrieval to ground LLM responses.<n>To tackle this problem, we propose Multi-Agent Filtering Retrieval-Augmented Generation (MAIN-RAG)<n>MAIN-RAG is a training-free RAG framework that leverages multiple LLM agents to collaboratively filter and score retrieved documents.
arXiv Detail & Related papers (2024-12-31T08:07:26Z)
Methods for Legal Citation Prediction in the Age of LLMs: An Australian Law Case Study [9.30538764385435]
We focus on the problem of legal citation prediction within the Australian law context, where correctly identifying and citing relevant legislations or precedents is critical.<n>Our findings indicate that domain-specific pre-training alone is insufficient for achieving satisfactory citation accuracy even after law-specialised pre-training.<n>In contrast, instruction tuning on our task-specific dataset dramatically boosts performance reaching the best results across all settings.
arXiv Detail & Related papers (2024-12-09T07:46:14Z)
Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs [67.54302101989542]
Legal case retrieval aims to provide similar cases as references for a given fact description. Existing works mainly focus on case-to-case retrieval using lengthy queries. Data scale is insufficient to satisfy the training requirements of existing data-hungry neural models.
arXiv Detail & Related papers (2024-10-09T06:26:39Z)
SFR-RAG: Towards Contextually Faithful LLMs [57.666165819196486]
Retrieval Augmented Generation (RAG) is a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance. We introduce SFR-RAG, a small LLM that is instruction-textual with an emphasis on context-grounded generation and hallucination. We also present ConBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks.
arXiv Detail & Related papers (2024-09-16T01:08:18Z)
OR-Bench: An Over-Refusal Benchmark for Large Language Models [65.34666117785179]
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs.<n>This study proposes a novel method for automatically generating large-scale over-refusal datasets.<n>We introduce OR-Bench, the first large-scale over-refusal benchmark.
arXiv Detail & Related papers (2024-05-31T15:44:33Z)
Rethinking Legal Compliance Automation: Opportunities with Large Language Models [2.9088208525097365]
We argue that the examination of (textual) legal artifacts should, first employ broader context than sentences. We present a compliance analysis approach designed to address these limitations.
arXiv Detail & Related papers (2024-04-22T17:10:27Z)
Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity [59.57065228857247]
Retrieval-augmented Large Language Models (LLMs) have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA) We propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs based on the query complexity. We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems.
arXiv Detail & Related papers (2024-03-21T13:52:30Z)
From Chaos to Clarity: Claim Normalization to Empower Fact-Checking [57.024192702939736]
Claim Normalization (aka ClaimNorm) aims to decompose complex and noisy social media posts into more straightforward and understandable forms. We propose CACN, a pioneering approach that leverages chain-of-thought and claim check-worthiness estimation. Our experiments demonstrate that CACN outperforms several baselines across various evaluation measures.
arXiv Detail & Related papers (2023-10-22T16:07:06Z)
Status Quo and Problems of Requirements Engineering for Machine Learning: Results from an International Survey [7.164324501049983]
Requirements Engineering (RE) can help address many problems when engineering Machine Learning-enabled systems. We conducted a survey to gather practitioner insights into the status quo and problems of RE in ML-enabled systems. We found significant differences in RE practices within ML projects.
arXiv Detail & Related papers (2023-10-10T15:53:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.