Related papers: An Empirical Study on LLM-based Classification of Requirements-related Provisions in Food-safety Regulations

An Empirical Study on LLM-based Classification of Requirements-related Provisions in Food-safety Regulations

URL: http://arxiv.org/abs/2501.14683v1
Date: Fri, 24 Jan 2025 17:59:14 GMT
Title: An Empirical Study on LLM-based Classification of Requirements-related Provisions in Food-safety Regulations
Authors: Shabnam Hassani, Mehrdad Sabetzadeh, Daniel Amyot,
Abstract summary: We conduct a Grounded Theory study of food-safety regulations.<n>We develop a conceptual characterization of food-safety concepts that closely relate to systems and software requirements.<n>We examine the effectiveness of two families of large language models (LLMs) in automatically classifying legal provisions.
Score: 3.1776778131016368
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As Industry 4.0 transforms the food industry, the role of software in achieving compliance with food-safety regulations is becoming increasingly critical. Food-safety regulations, like those in many legal domains, have largely been articulated in a technology-independent manner to ensure their longevity and broad applicability. However, this approach leaves a gap between the regulations and the modern systems and software increasingly used to implement them. In this article, we pursue two main goals. First, we conduct a Grounded Theory study of food-safety regulations and develop a conceptual characterization of food-safety concepts that closely relate to systems and software requirements. Second, we examine the effectiveness of two families of large language models (LLMs) -- BERT and GPT -- in automatically classifying legal provisions based on requirements-related food-safety concepts. Our results show that: (a) when fine-tuned, the accuracy differences between the best-performing models in the BERT and GPT families are relatively small. Nevertheless, the most powerful model in our experiments, GPT-4o, still achieves the highest accuracy, with an average Precision of 89% and an average Recall of 87%; (b) few-shot learning with GPT-4o increases Recall to 97% but decreases Precision to 65%, suggesting a trade-off between fine-tuning and few-shot learning; (c) despite our training examples being drawn exclusively from Canadian regulations, LLM-based classification performs consistently well on test provisions from the US, indicating a degree of generalizability across regulatory jurisdictions; and (d) for our classification task, LLMs significantly outperform simpler baselines constructed using long short-term memory (LSTM) networks and automatic keyword extraction.

Related papers

Beyond Traditional Algorithms: Leveraging LLMs for Accurate Cross-Border Entity Identification [0.0]
Cross-border financial activities have underscored the necessity of accurately identifying and classifying foreign entities.<n>This paper explores Large Language Models (LLMs) as a flexible alternative to traditional algorithms.<n>We evaluate traditional methods, Hugging Face-based LLMs, and interface-based LLMs (e.g., Microsoft Copilot, Alibaba's Qwen 2.5) using a dataset of 65 Portuguese company cases.
arXiv Detail & Related papers (2025-07-15T08:28:24Z)
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z)
LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text [0.0]
We develop a regression model to evaluate the quality of legal texts based on clarity, coherence, and terminology.<n>Our analysis identifies three key findings: First, model quality levels off at 14 billion parameters, with only a marginal improvement of $2.7%$ noted at 72 billion parameters.
arXiv Detail & Related papers (2025-05-30T17:30:18Z)
Classification or Prompting: A Case Study on Legal Requirements Traceability [6.411835643029738]
New regulations are continuously introduced to ensure that software development complies with the ethical concerns and prioritizes public safety. A prerequisite for demonstrating compliance involves tracing software requirements to legal provisions. This paper investigates two automated solutions to predict trace links between requirements and legal provisions.
arXiv Detail & Related papers (2025-02-07T13:33:40Z)
The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>Our resulting model, LibraChem, outperforms leading LLMs including Claude-3, GPT-4o, and LLaMA-3 by margins of 13.44%, 7.16%, and 7.10% respectively.
arXiv Detail & Related papers (2025-01-20T06:35:01Z)
Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.<n>LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.<n>We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z)
ConU: Conformal Uncertainty in Large Language Models with Correctness Coverage Guarantees [68.33498595506941]
We introduce a novel uncertainty measure based on self-consistency theory. We then develop a conformal uncertainty criterion by integrating the uncertainty condition aligned with correctness into the CP algorithm. Empirical evaluations indicate that our uncertainty measure outperforms prior state-of-the-art methods.
arXiv Detail & Related papers (2024-06-29T17:33:07Z)
Uncertainty Aware Learning for Language Model Alignment [97.36361196793929]
We propose uncertainty-aware learning (UAL) to improve the model alignment of different task scenarios. We implement UAL in a simple fashion -- adaptively setting the label smoothing value of training according to the uncertainty of individual samples. Experiments on widely used benchmarks demonstrate that our UAL significantly and consistently outperforms standard supervised fine-tuning.
arXiv Detail & Related papers (2024-06-07T11:37:45Z)
Enhancing Legal Compliance and Regulation Analysis with Large Language Models [0.0]
This research explores the application of Large Language Models (LLMs) to accurately classify legal provisions and automate compliance checks. Our findings demonstrate promising results, indicating LLMs' significant potential to enhance legal compliance and regulatory analysis efficiency, notably by reducing manual workload and improving accuracy within reasonable time financial constraints.
arXiv Detail & Related papers (2024-04-26T16:40:49Z)
Rethinking Legal Compliance Automation: Opportunities with Large Language Models [2.9088208525097365]
We argue that the examination of (textual) legal artifacts should, first employ broader context than sentences. We present a compliance analysis approach designed to address these limitations.
arXiv Detail & Related papers (2024-04-22T17:10:27Z)
GroundCocoa: A Benchmark for Evaluating Compositional & Conditional Reasoning in Language Models [14.108788704400643]
GroundCocoa is a lexically diverse benchmark connecting these reasoning skills to the real-world problem of flight booking. Our task involves aligning detailed user preferences with available flight options presented in a multiple-choice format. Results indicate a significant disparity in performance among current state-of-the-art LLMs with even the best performing model, GPT-4 Turbo, not exceeding 67% accuracy despite advanced prompting techniques.
arXiv Detail & Related papers (2024-04-05T17:36:26Z)
SoFA: Shielded On-the-fly Alignment via Priority Rule Following [90.32819418613407]
This paper introduces a novel alignment paradigm, priority rule following, which defines rules as the primary control mechanism in each dialog. We present PriorityDistill, a semi-automated approach for distilling priority following signals from simulations to ensure robust rule integration and adherence.
arXiv Detail & Related papers (2024-02-27T09:52:27Z)
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z)
Flames: Benchmarking Value Alignment of LLMs in Chinese [86.73527292670308]
This paper proposes a value alignment benchmark named Flames. It encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values. Our findings indicate that all the evaluated LLMs demonstrate relatively poor performance on Flames.
arXiv Detail & Related papers (2023-11-12T17:18:21Z)
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment [35.42539816648068]
This paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing large language models (LLMs) The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness.
arXiv Detail & Related papers (2023-08-10T06:43:44Z)
Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.