Classification or Prompting: A Case Study on Legal Requirements Traceability
- URL: http://arxiv.org/abs/2502.04916v4
- Date: Fri, 22 Aug 2025 14:52:02 GMT
- Title: Classification or Prompting: A Case Study on Legal Requirements Traceability
- Authors: Romina Etezadi, Sallam Abualhaija, Chetan Arora, Lionel Briand,
- Abstract summary: Legal requirements traceability is a key task where engineers must analyze technical requirements against target artifacts.<n>In this paper, we investigate two automated solutions based on language models, including large ones (LLMs)<n>The first solution, Kashif, is a classifier that leverages sentence transformers and semantic similarity.<n>The second solution, RICE_LRT, prompts a recent generative LLM based on RICE, a prompt engineering framework.
- Score: 4.629156733452248
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: New regulations are introduced to ensure software development aligns with ethical concerns and protects public safety. Showing compliance requires tracing requirements to legal provisions. Requirements traceability is a key task where engineers must analyze technical requirements against target artifacts, often within limited time. Manually analyzing complex systems with hundreds of requirements is infeasible. The legal dimension adds challenges that increase effort. In this paper, we investigate two automated solutions based on language models, including large ones (LLMs). The first solution, Kashif, is a classifier that leverages sentence transformers and semantic similarity. The second solution, RICE_LRT, prompts a recent generative LLM based on RICE, a prompt engineering framework. On a benchmark dataset, we empirically evaluate Kashif and compare it against five different baseline classifiers from the literature. Kashif can identify trace links with a recall of 67%, precision of 50%, and F2 score of 63%, outperforming the best baseline by a substantial margin of 41 percentage points (pp) in F2. However, on unseen, more complex requirements documents traced to the European General Data Protection Regulation (GDPR), Kashif performs poorly, yielding an average recall of 15%, an average precision of 10%, and an average F2 score of 13.5%. On the same documents, however, our RICE solution yields an average recall of 84%, an average precision of 30%, and an average F2 score of 61%. RICE achieved a remarkable improvement of 47.5 pp over Kashif in terms of F2 score. Our results suggest that requirements traceability in the legal context cannot be simply addressed by building classifiers, as such solutions do not generalize and fail to perform well on complex regulations and requirements. Resorting to generative LLMs, with careful prompt engineering, is thus a more promising alternative.
Related papers
- Reliability by design: quantifying and eliminating fabrication risk in LLMs. From generative to consultative AI: a comparative analysis in the legal domain and lessons for high-stakes knowledge bases [0.0]
This paper examines how to make large language models reliable for high-stakes legal work by reducing hallucinations.<n>It distinguishes three AI paradigms: (1) standalone generative models ("creative oracle"), (2) basic retrieval-augmented systems ("expert archivist"), and (3) an advanced, end-to-end optimized RAG system ("rigorous archivist"
arXiv Detail & Related papers (2026-01-21T21:26:42Z) - SeBERTis: A Framework for Producing Classifiers of Security-Related Issue Reports [8.545800179148442]
SEBERTIS is a framework to train Deep Neural Networks (DNNs) as classifiers independent of lexical cues.<n>Our framework achieves a 0.9880 F1-score in detecting security-related issues of a curated corpus of 10,000 GitHub issue reports.
arXiv Detail & Related papers (2025-12-17T01:23:11Z) - RefineBench: Evaluating Refinement Capability of Language Models via Checklists [71.02281792867531]
We evaluate two refinement modes: guided refinement and self-refinement.<n>In guided refinement, both proprietary LMs and large open-weight LMs can leverage targeted feedback to refine responses to near-perfect levels within five turns.<n>These findings suggest that frontier LMs require breakthroughs to self-refine their incorrect responses.
arXiv Detail & Related papers (2025-11-27T07:20:52Z) - SLEAN: Simple Lightweight Ensemble Analysis Network for Multi-Provider LLM Coordination: Design, Implementation, and Vibe Coding Bug Investigation Case Study [0.0]
SLEAN operates as a simple prompt bridge between LLMs using.txt templates, requiring no deep technical knowledge for deployment.<n>The three-phase protocol formed by independent analysis, cross-critique, and arbitration, filters harmful AI-generated code suggestions.<n>The file-driven, provider-agnostic architecture enables deployment without specialized coding expertise.
arXiv Detail & Related papers (2025-10-11T04:24:04Z) - Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z) - Fine-Tuning Vision-Language Models for Markdown Conversion of Financial Tables in Malaysian Audited Financial Reports [0.0]
We propose a fine-tuned vision-language model (VLM) based on Qwen2.5-VL-7B.<n>Our approach includes a curated dataset of 2,152 image-text pairs with augmentations and a supervised fine-tuning strategy using LoRA.<n>Our model achieves a 92.20% overall accuracy on the criteria-based assessment and a 96.53% markdown TEDS score.
arXiv Detail & Related papers (2025-08-04T04:54:00Z) - Evaluating and Improving Large Language Models for Competitive Program Generation [18.564450345359468]
This study aims to evaluate and improve large language models (LLMs) in solving real-world competitive programming problems.<n>We collect 117 problems from nine regional ICPC/CCPC contests held in 2024 and design four filtering criteria to construct a curated benchmark consisting of 80 problems.<n>We evaluate its competitive program generation capabilities through the online judge (OJ) platforms, guided by a carefully designed basic prompt.
arXiv Detail & Related papers (2025-06-28T17:18:23Z) - ReqBrain: Task-Specific Instruction Tuning of LLMs for AI-Assisted Requirements Generation [4.475603469482274]
Software engineers can engage with ReqBrain through chat-based sessions to automatically generate software requirements.<n>Top-performing model, Zephyr-7b-beta, achieved 89.30% Fl using the BERT score and a FRUGAL score of 91.20 in generating authentic and adequate requirements.
arXiv Detail & Related papers (2025-05-23T08:45:46Z) - AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios [51.46347732659174]
Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications.<n>AgentIF is the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios.
arXiv Detail & Related papers (2025-05-22T17:31:10Z) - Automated Repair of Ambiguous Natural Language Requirements [9.379494157034083]
Large language models (LLMs) in software engineering have amplified the role of natural language (NL)<n>We introduce automated repair of ambiguous NL requirements, which we approach by reducing code generation uncertainty.<n>Our results show that SpecFix modifies 23.93% of the requirements, leading to a 33.66% improvement in model Pass@1 on the modified requirements.
arXiv Detail & Related papers (2025-05-12T06:47:53Z) - TVR: Automotive System Requirement Traceability Validation and Recovery Through Retrieval-Augmented Generation [7.50061902435987]
Traceability between stakeholder requirements and system requirements is crucial to ensure consistency, correctness, and regulatory compliance.
Existing approaches do not address traceability between stakeholder and system requirements, rely on open-source data, and do not address the validation of manual links established by engineers.
We introduce TVR, a requirement Traceability Validation and Recovery approach primarily targeting automotive systems.
arXiv Detail & Related papers (2025-04-21T20:37:23Z) - Retrieval-Augmented Generation with Conflicting Evidence [57.66282463340297]
Large language model (LLM) agents are increasingly employing retrieval-augmented generation (RAG) to improve the factuality of their responses.
In practice, these systems often need to handle ambiguous user queries and potentially conflicting information from multiple sources.
We propose RAMDocs (Retrieval with Ambiguity and Misinformation in Documents), a new dataset that simulates complex and realistic scenarios for conflicting evidence for a user query.
arXiv Detail & Related papers (2025-04-17T16:46:11Z) - Evaluating Retrieval Augmented Generative Models for Document Queries in Transportation Safety [0.7373617024876725]
This study evaluates the performance of three fine-tuned generative models, ChatGPT, Google's Vertex AI, and ORNL Retrieval Augmented Generation augmented LLaMA 2 and LLaMA.
We developed 100 realistic queries relevant to route planning and permitting requirements.
Results demonstrated that the RAG-augmented LLaMA models significantly outperformed Vertex AI and ChatGPT, providing more detailed and generally accurate information.
arXiv Detail & Related papers (2025-04-09T16:37:03Z) - SUNAR: Semantic Uncertainty based Neighborhood Aware Retrieval for Complex QA [2.7703990035016868]
We introduce SUNAR, a novel approach that leverages large language models to guide a Neighborhood Aware Retrieval process.
We validate our approach through extensive experiments on two complex QA datasets.
Our results show that SUNAR significantly outperforms existing retrieve-and-reason baselines, achieving up to a 31.84% improvement in performance.
arXiv Detail & Related papers (2025-03-23T08:50:44Z) - Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z) - An Empirical Study on LLM-based Classification of Requirements-related Provisions in Food-safety Regulations [3.1776778131016368]
We conduct a Grounded Theory study of food-safety regulations.<n>We develop a conceptual characterization of food-safety concepts that closely relate to systems and software requirements.<n>We examine the effectiveness of two families of large language models (LLMs) in automatically classifying legal provisions.
arXiv Detail & Related papers (2025-01-24T17:59:14Z) - The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.
This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.
We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z) - MAIN-RAG: Multi-Agent Filtering Retrieval-Augmented Generation [34.66546005629471]
Large Language Models (LLMs) are essential tools for various natural language processing tasks but often suffer from generating outdated or incorrect information.<n>Retrieval-Augmented Generation (RAG) addresses this issue by incorporating external, real-time information retrieval to ground LLM responses.<n>To tackle this problem, we propose Multi-Agent Filtering Retrieval-Augmented Generation (MAIN-RAG)<n>MAIN-RAG is a training-free RAG framework that leverages multiple LLM agents to collaboratively filter and score retrieved documents.
arXiv Detail & Related papers (2024-12-31T08:07:26Z) - Methods for Legal Citation Prediction in the Age of LLMs: An Australian Law Case Study [9.30538764385435]
We focus on the problem of legal citation prediction within the Australian law context, where correctly identifying and citing relevant legislations or precedents is critical.<n>Our findings indicate that domain-specific pre-training alone is insufficient for achieving satisfactory citation accuracy even after law-specialised pre-training.<n>In contrast, instruction tuning on our task-specific dataset dramatically boosts performance reaching the best results across all settings.
arXiv Detail & Related papers (2024-12-09T07:46:14Z) - Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios [49.53589774730807]
Multimodal large language models (MLLMs) have recently achieved state-of-the-art performance on tasks ranging from visual question answering to video understanding.<n>We reveal a response uncertainty phenomenon: twelve state-of-the-art open-source MLLMs overturn a previously correct answer in 65% of cases after receiving a single deceptive cue.
arXiv Detail & Related papers (2024-11-05T01:11:28Z) - Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs [67.54302101989542]
Legal case retrieval aims to provide similar cases as references for a given fact description.
Existing works mainly focus on case-to-case retrieval using lengthy queries.
Data scale is insufficient to satisfy the training requirements of existing data-hungry neural models.
arXiv Detail & Related papers (2024-10-09T06:26:39Z) - SFR-RAG: Towards Contextually Faithful LLMs [57.666165819196486]
Retrieval Augmented Generation (RAG) is a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance.
We introduce SFR-RAG, a small LLM that is instruction-textual with an emphasis on context-grounded generation and hallucination.
We also present ConBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks.
arXiv Detail & Related papers (2024-09-16T01:08:18Z) - OR-Bench: An Over-Refusal Benchmark for Large Language Models [65.34666117785179]
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs.<n>This study proposes a novel method for automatically generating large-scale over-refusal datasets.<n>We introduce OR-Bench, the first large-scale over-refusal benchmark.
arXiv Detail & Related papers (2024-05-31T15:44:33Z) - Rethinking Legal Compliance Automation: Opportunities with Large Language Models [2.9088208525097365]
We argue that the examination of (textual) legal artifacts should, first employ broader context than sentences.
We present a compliance analysis approach designed to address these limitations.
arXiv Detail & Related papers (2024-04-22T17:10:27Z) - Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity [59.57065228857247]
Retrieval-augmented Large Language Models (LLMs) have emerged as a promising approach to enhancing response accuracy in several tasks, such as Question-Answering (QA)
We propose a novel adaptive QA framework, that can dynamically select the most suitable strategy for (retrieval-augmented) LLMs based on the query complexity.
We validate our model on a set of open-domain QA datasets, covering multiple query complexities, and show that ours enhances the overall efficiency and accuracy of QA systems.
arXiv Detail & Related papers (2024-03-21T13:52:30Z) - From Chaos to Clarity: Claim Normalization to Empower Fact-Checking [57.024192702939736]
Claim Normalization (aka ClaimNorm) aims to decompose complex and noisy social media posts into more straightforward and understandable forms.
We propose CACN, a pioneering approach that leverages chain-of-thought and claim check-worthiness estimation.
Our experiments demonstrate that CACN outperforms several baselines across various evaluation measures.
arXiv Detail & Related papers (2023-10-22T16:07:06Z) - Status Quo and Problems of Requirements Engineering for Machine
Learning: Results from an International Survey [7.164324501049983]
Requirements Engineering (RE) can help address many problems when engineering Machine Learning-enabled systems.
We conducted a survey to gather practitioner insights into the status quo and problems of RE in ML-enabled systems.
We found significant differences in RE practices within ML projects.
arXiv Detail & Related papers (2023-10-10T15:53:50Z) - PRover: Proof Generation for Interpretable Reasoning over Rules [81.40404921232192]
We propose a transformer-based model that answers binary questions over rule-bases and generates the corresponding proofs.
Our model learns to predict nodes and edges corresponding to proof graphs in an efficient constrained training paradigm.
We conduct experiments on synthetic, hand-authored, and human-paraphrased rule-bases to show promising results for QA and proof generation.
arXiv Detail & Related papers (2020-10-06T15:47:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.