Related papers: Active Learning for Robust and Representative LLM Generation in Safety-Critical Scenarios

Active Learning for Robust and Representative LLM Generation in Safety-Critical Scenarios

URL: http://arxiv.org/abs/2410.11114v1
Date: Mon, 14 Oct 2024 21:48:14 GMT
Title: Active Learning for Robust and Representative LLM Generation in Safety-Critical Scenarios
Authors: Sabit Hassan, Anthony Sicilia, Malihe Alikhani,
Abstract summary: Large Language Models (LLMs) can generate valuable data for safety measures, but often exhibit distributional biases. We propose a novel framework that integrates active learning with clustering to guide LLM generation. Our results show that the proposed framework produces a more representative set of safety scenarios without requiring prior knowledge of the underlying data distribution.
Score: 32.16984263644299
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ensuring robust safety measures across a wide range of scenarios is crucial for user-facing systems. While Large Language Models (LLMs) can generate valuable data for safety measures, they often exhibit distributional biases, focusing on common scenarios and neglecting rare but critical cases. This can undermine the effectiveness of safety protocols developed using such data. To address this, we propose a novel framework that integrates active learning with clustering to guide LLM generation, enhancing their representativeness and robustness in safety scenarios. We demonstrate the effectiveness of our approach by constructing a dataset of 5.4K potential safety violations through an iterative process involving LLM generation and an active learner model's feedback. Our results show that the proposed framework produces a more representative set of safety scenarios without requiring prior knowledge of the underlying data distribution. Additionally, data acquired through our method improves the accuracy and F1 score of both the active learner model as well models outside the scope of active learning process, highlighting its broad applicability.

Related papers

Shape it Up! Restoring LLM Safety during Finetuning [66.46166656543761]
Finetuning large language models (LLMs) enables user-specific customization but introduces critical safety risks.<n>We propose dynamic safety shaping (DSS), a framework that uses fine-grained safety signals to reinforce learning from safe segments of a response while suppressing unsafe content.<n>We present STAR-DSS, guided by STAR scores, that robustly mitigates finetuning risks and delivers substantial safety improvements across diverse threats, datasets, and model families.
arXiv Detail & Related papers (2025-05-22T18:05:16Z)
Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models [92.38300626647342]
Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies.
arXiv Detail & Related papers (2025-03-24T20:41:57Z)
The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents [29.974647411289826]
Large Language Models (LLMs) have made remarkable advances in role-playing dialogue agents, demonstrating their utility in character simulations. It remains challenging for these agents to balance character portrayal utility with content safety because this essential character simulation often comes with the risk of generating unsafe content. We propose a novel Adaptive Dynamic Multi-Preference (ADMP) method, which dynamically adjusts safety-utility preferences based on the degree of risk coupling.
arXiv Detail & Related papers (2025-02-28T06:18:50Z)
Model Developmental Safety: A Safety-Centric Method and Applications in Vision-Language Models [75.8161094916476]
We study how to develop a pretrained vision-language model (aka the CLIP model) for acquiring new capabilities or improving existing capabilities of image classification. Our experiments on improving vision perception capabilities on autonomous driving and scene recognition datasets demonstrate the efficacy of the proposed approach.
arXiv Detail & Related papers (2024-10-04T22:34:58Z)
Towards Inference-time Category-wise Safety Steering for Large Language Models [3.712541089289745]
Large language models (LLMs) have seen unprecedented advancements in capabilities and applications across a variety of use-cases. The fragile nature of LLMs warrants additional safety steering steps via training-free, inference-time methods. Unlike recent inference-time safety steering works, in this paper we explore safety steering of LLM outputs using category-specific steering vectors.
arXiv Detail & Related papers (2024-10-02T02:02:06Z)
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models [94.39278422567955]
Fine-tuning large language models (LLMs) on human preferences has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during the fine-tuning remains a critical concern. We propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to address this issue.
arXiv Detail & Related papers (2024-08-27T17:31:21Z)
Diffusion Models for Offline Multi-agent Reinforcement Learning with Safety Constraints [0.0]
We introduce an innovative framework integrating diffusion models within the Multi-agent Reinforcement Learning paradigm. This approach notably enhances the safety of actions taken by multiple agents through risk mitigation while modeling coordinated action.
arXiv Detail & Related papers (2024-06-30T16:05:31Z)
Recent Advances in Federated Learning Driven Large Language Models: A Survey on Architecture, Performance, and Security [24.969739515876515]
Federated Learning (FL) offers a promising paradigm for training Large Language Models (LLMs) in a decentralized manner while preserving data privacy and minimizing communication overhead.<n>We review a range of strategies enabling unlearning in federated LLMs, including perturbation-based methods, model decomposition, and incremental retraining.<n>This survey identifies critical research directions toward developing secure, adaptable, and high-performing federated LLM systems for real-world deployment.
arXiv Detail & Related papers (2024-06-14T08:40:58Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
Empowering Autonomous Driving with Large Language Models: A Safety Perspective [82.90376711290808]
This paper explores the integration of Large Language Models (LLMs) into Autonomous Driving systems. LLMs are intelligent decision-makers in behavioral planning, augmented with a safety verifier shield for contextual safety learning. We present two key studies in a simulated environment: an adaptive LLM-conditioned Model Predictive Control (MPC) and an LLM-enabled interactive behavior planning scheme with a state machine.
arXiv Detail & Related papers (2023-11-28T03:13:09Z)
Safety-Critical Scenario Generation Via Reinforcement Learning Based Editing [20.99962858782196]
We propose a deep reinforcement learning approach that generates safety-critical scenarios by sequential editing. Our framework employs a reward function consisting of both risk and plausibility objectives. Our evaluation demonstrates that the proposed method generates safety-critical scenarios of higher quality compared with previous approaches.
arXiv Detail & Related papers (2023-06-25T05:15:25Z)
SAMBA: Safe Model-Based & Active Reinforcement Learning [59.01424351231993]
SAMBA is a framework for safe reinforcement learning that combines aspects from probabilistic modelling, information theory, and statistics. We evaluate our algorithm on a variety of safe dynamical system benchmarks involving both low and high-dimensional state representations. We provide intuition as to the effectiveness of the framework by a detailed analysis of our active metrics and safety constraints.
arXiv Detail & Related papers (2020-06-12T10:40:46Z)
Chance-Constrained Trajectory Optimization for Safe Exploration and Learning of Nonlinear Systems [81.7983463275447]
Learning-based control algorithms require data collection with abundant supervision for training. We present a new approach for optimal motion planning with safe exploration that integrates chance-constrained optimal control with dynamics learning and feedback control.
arXiv Detail & Related papers (2020-05-09T05:57:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.