Related papers: Class-RAG: Real-Time Content Moderation with Retrieval Augmented Generation

Class-RAG: Real-Time Content Moderation with Retrieval Augmented Generation

URL: http://arxiv.org/abs/2410.14881v2
Date: Tue, 17 Dec 2024 22:07:18 GMT
Title: Class-RAG: Real-Time Content Moderation with Retrieval Augmented Generation
Authors: Jianfa Chen, Emily Shen, Trupti Bavalatti, Xiaowen Lin, Yongkai Wang, Shuming Hu, Harihar Subramanyam, Ksheeraj Sai Vepuri, Ming Jiang, Ji Qi, Li Chen, Nan Jiang, Ankit Jain,
Abstract summary: We propose a Classification approach employing Retrieval-Augmented Generation (Class-RAG)<n>Compared to model fine-tuning, Class-RAG demonstrates flexibility and transparency in decision-making, outperforms on classification and is more robust against adversarial attack.<n>Our findings also suggest that Class-RAG performance scales with retrieval library size, indicating that increasing the library size is a viable and low-cost approach to improve content moderation.
Score: 15.298017013140385
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Robust content moderation classifiers are essential for the safety of Generative AI systems. In this task, differences between safe and unsafe inputs are often extremely subtle, making it difficult for classifiers (and indeed, even humans) to properly distinguish violating vs. benign samples without context or explanation. Scaling risk discovery and mitigation through continuous model fine-tuning is also slow, challenging and costly, preventing developers from being able to respond quickly and effectively to emergent harms. We propose a Classification approach employing Retrieval-Augmented Generation (Class-RAG). Class-RAG extends the capability of its base LLM through access to a retrieval library which can be dynamically updated to enable semantic hotfixing for immediate, flexible risk mitigation. Compared to model fine-tuning, Class-RAG demonstrates flexibility and transparency in decision-making, outperforms on classification and is more robust against adversarial attack, as evidenced by empirical studies. Our findings also suggest that Class-RAG performance scales with retrieval library size, indicating that increasing the library size is a viable and low-cost approach to improve content moderation.

Related papers

Safety Devolution in AI Agents [56.482973617087254]
This study investigates how expanding retrieval access affects model reliability, bias propagation, and harmful content generation.<n>Retrieval-augmented agents built on aligned LLMs often behave more unsafely than uncensored models without retrieval.<n>These findings underscore the need for robust mitigation strategies to ensure fairness and reliability in retrieval-augmented and increasingly autonomous AI systems.
arXiv Detail & Related papers (2025-05-20T11:21:40Z)
StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning [51.003833566279006]
Class-Incremental Learning (CIL) seeks to develop models that continuously learn new action categories over time without previously acquired knowledge.<n>Existing approaches either rely on forgetting, raising concerns over memory and privacy, or adapt static image-based methods that neglect temporal modeling.<n>We propose a unified and exemplar-free VCIL framework that explicitly disentangles and preserves information.
arXiv Detail & Related papers (2025-05-20T06:46:51Z)
Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps [16.84310001807895]
This paper introduces a model-agnostic approach that can be applied to A-RAG methods.<n>Specifically, we use cache access and parallel generation to speed up the prefilling and decoding stages respectively.
arXiv Detail & Related papers (2025-05-19T05:39:38Z)
A generative approach to LLM harmfulness detection with special red flag tokens [15.796683630119654]
We propose to expand the model's vocabulary with a special token we call red flag token (rf>) This novel safety training method effectively augments LLMs into generative classifiers of harmfulness at all times during the conversation. It also evaluates each generated answer rather than just the input prompt and provides a stronger defence against sampling-based attacks.
arXiv Detail & Related papers (2025-02-22T21:48:48Z)
TrustRAG: Enhancing Robustness and Trustworthiness in RAG [31.231916859341865]
TrustRAG is a framework that systematically filters compromised and irrelevant contents before they are retrieved for generation. TrustRAG delivers substantial improvements in retrieval accuracy, efficiency, and attack resistance compared to existing approaches.
arXiv Detail & Related papers (2025-01-01T15:57:34Z)
Enhancing AI Safety Through the Fusion of Low Rank Adapters [7.384556630042846]
Low-Rank Adapter Fusion mitigates harmful responses when faced with malicious prompts. We show a 42% reduction in the harmfulness rate by leveraging LoRA fusion between a task adapter and a safety adapter. We also observe exaggerated safety behaviour, where the model rejects safe prompts that closely resemble unsafe ones.
arXiv Detail & Related papers (2024-12-30T13:12:27Z)
Towards More Robust Retrieval-Augmented Generation: Evaluating RAG Under Adversarial Poisoning Attacks [45.07581174558107]
Retrieval-Augmented Generation (RAG) systems have emerged as a promising solution to mitigate hallucinations. RAG systems are vulnerable to adversarial poisoning attacks, where malicious passages injected into retrieval databases can mislead the model into generating factually incorrect outputs. This paper investigates both the retrieval and the generation components of RAG systems to understand how to enhance their robustness against such attacks.
arXiv Detail & Related papers (2024-12-21T17:31:52Z)
Resilience to the Flowing Unknown: an Open Set Recognition Framework for Data Streams [6.7236795813629]
This work investigates the application of an Open Set Recognition framework that combines classification and clustering to address the textitover-occupied space problem in streaming scenarios.
arXiv Detail & Related papers (2024-10-31T11:06:54Z)
UncertaintyRAG: Span-Level Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation [93.38604803625294]
We present UncertaintyRAG, a novel approach for long-context Retrieval-Augmented Generation (RAG) We use Signal-to-Noise Ratio (SNR)-based span uncertainty to estimate similarity between text chunks. UncertaintyRAG outperforms baselines by 2.03% on LLaMA-2-7B, achieving state-of-the-art results.
arXiv Detail & Related papers (2024-10-03T17:39:38Z)
ShieldGemma: Generative AI Content Moderation Based on Gemma [49.91147965876678]
ShieldGemma is a suite of safety content moderation models built upon Gemma2. Models provide robust, state-of-the-art predictions of safety risks across key harm types.
arXiv Detail & Related papers (2024-07-31T17:48:14Z)
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z)
Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity [80.16488817177182]
GNNs are vulnerable to the model stealing attack, a nefarious endeavor geared towards duplicating the target model via query permissions. We introduce three model stealing attacks to adapt to different actual scenarios.
arXiv Detail & Related papers (2023-12-18T05:42:31Z)
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations [29.32704733570445]
We introduce Llama Guard, an input-output safeguard model geared towards Human-AI conversation use cases. Llama Guard incorporates a safety risk taxonomy, a valuable tool for categorizing a specific set of safety risks. Llama Guard demonstrates strong performance on existing benchmarks such as the OpenAI Moderation Evaluation dataset and ToxicChat.
arXiv Detail & Related papers (2023-12-07T19:40:50Z)
Activate and Reject: Towards Safe Domain Generalization under Category Shift [71.95548187205736]
We study a practical problem of Domain Generalization under Category Shift (DGCS) It aims to simultaneously detect unknown-class samples and classify known-class samples in the target domains. Compared to prior DG works, we face two new challenges: 1) how to learn the concept of unknown'' during training with only source known-class samples, and 2) how to adapt the source-trained model to unseen environments.
arXiv Detail & Related papers (2023-10-07T07:53:12Z)
Disentangled Contrastive Collaborative Filtering [36.400303346450514]
Graph contrastive learning (GCL) has exhibited powerful performance in addressing the supervision label shortage issue. We propose a Disentangled Contrastive Collaborative Filtering framework (DCCF) to realize intent disentanglement with self-supervised augmentation. Our DCCF is able to not only distill finer-grained latent factors from the entangled self-supervision signals but also alleviate the augmentation-induced noise.
arXiv Detail & Related papers (2023-05-04T11:53:38Z)
PatchMix Augmentation to Identify Causal Features in Few-shot Learning [55.64873998196191]
Few-shot learning aims to transfer knowledge learned from base with sufficient categories labelled data to novel categories with scarce known information. We propose a novel data augmentation strategy dubbed as PatchMix that can break this spurious dependency. We show that such an augmentation mechanism, different from existing ones, is able to identify the causal features.
arXiv Detail & Related papers (2022-11-29T08:41:29Z)
Mitigating Forgetting in Online Continual Learning via Contrasting Semantically Distinct Augmentations [22.289830907729705]
Online continual learning (OCL) aims to enable model learning from a non-stationary data stream to continuously acquire new knowledge as well as retain the learnt one. Main challenge comes from the "catastrophic forgetting" issue -- the inability to well remember the learnt knowledge while learning the new ones.
arXiv Detail & Related papers (2022-11-10T05:29:43Z)
Self-Supervised Class Incremental Learning [51.62542103481908]
Existing Class Incremental Learning (CIL) methods are based on a supervised classification framework sensitive to data labels. When updating them based on the new class data, they suffer from catastrophic forgetting: the model cannot discern old class data clearly from the new. In this paper, we explore the performance of Self-Supervised representation learning in Class Incremental Learning (SSCIL) for the first time.
arXiv Detail & Related papers (2021-11-18T06:58:19Z)
Policy Smoothing for Provably Robust Reinforcement Learning [109.90239627115336]
We study the provable robustness of reinforcement learning against norm-bounded adversarial perturbations of the inputs. We generate certificates that guarantee that the total reward obtained by the smoothed policy will not fall below a certain threshold under a norm-bounded adversarial of perturbation the input.
arXiv Detail & Related papers (2021-06-21T21:42:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.