SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models
- URL: http://arxiv.org/abs/2509.15174v2
- Date: Wed, 08 Oct 2025 03:00:50 GMT
- Title: SMARTER: A Data-efficient Framework to Improve Toxicity Detection with Explanation via Self-augmenting Large Language Models
- Authors: Huy Nghiem, Advik Sachdeva, Hal Daumé III,
- Abstract summary: We introduce SMARTER, a framework for explainable content moderation using Large Language Models (LLMs)<n>In Stage 1, we leverage LLMs' own outputs to generate synthetic explanations for both correct and incorrect labels.<n>In Stage 2, we refine explanation quality through cross-model training, allowing weaker models to align stylistically and semantically with stronger ones.
- Score: 23.6287295499467
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: WARNING: This paper contains examples of offensive materials. To address the proliferation of toxic content on social media, we introduce SMARTER, we introduce SMARTER, a data-efficient two-stage framework for explainable content moderation using Large Language Models (LLMs). In Stage 1, we leverage LLMs' own outputs to generate synthetic explanations for both correct and incorrect labels, enabling alignment via preference optimization with minimal human supervision. In Stage 2, we refine explanation quality through cross-model training, allowing weaker models to align stylistically and semantically with stronger ones. Experiments on three benchmark tasks -- HateXplain, Latent Hate, and Implicit Hate -- demonstrate that SMARTER enables LLMs to achieve up to a 13.5% macro-F1 improvement over standard few-shot baselines while using only a fraction of the full training data. Our framework offers a scalable strategy for low-resource settings by harnessing LLMs' self-improving capabilities for both classification and explanation.
Related papers
- DaMoC: Efficiently Selecting the Optimal Large Language Model for Fine-tuning Domain Tasks Based on Data and Model Compression [7.1654056866441245]
Large language models (LLMs) excel in general tasks but struggle with domain-specific ones, requiring fine-tuning with specific data.<n>We introduce a Data and Model Compression Framework (DaMoC) that addresses this challenge.<n>We show that we can select the optimal LLM while saving approximately 20-fold in training time.
arXiv Detail & Related papers (2025-09-01T08:06:49Z) - LLM-based Semantic Augmentation for Harmful Content Detection [5.954202581988127]
This paper introduces an approach that prompts large language models to clean noisy text and provide context-rich explanations.<n>We evaluate on the SemEval 2024 multi-label Persuasive Meme dataset and validate on the Google Jigsaw toxic comments and Facebook hateful memes datasets.<n>Our results reveal that zero-shot LLM classification underperforms on these high-context tasks compared to supervised models.
arXiv Detail & Related papers (2025-04-22T02:59:03Z) - Refining Sentence Embedding Model through Ranking Sentences Generation with Large Language Models [60.00178316095646]
Sentence embedding is essential for many NLP tasks, with contrastive learning methods achieving strong performance using datasets like NLI.<n>Recent studies leverage large language models (LLMs) to generate sentence pairs, reducing annotation dependency.<n>We propose a method for controlling the generation direction of LLMs in the latent space. Unlike unconstrained generation, the controlled approach ensures meaningful semantic divergence.<n> Experiments on multiple benchmarks demonstrate that our method achieves new SOTA performance with a modest cost in ranking sentence synthesis.
arXiv Detail & Related papers (2025-02-19T12:07:53Z) - LLMs are Also Effective Embedding Models: An In-depth Overview [41.34518828980852]
Large language models (LLMs) have revolutionized natural language processing by achieving state-of-the-art performance across various tasks.<n>Recently, their effectiveness as embedding models has gained attention, marking a paradigm shift from traditional encoder-only models like ELMo and BERT to decoder-only, large-scale LLMs like GPT, LLaMA, and Mistral.
arXiv Detail & Related papers (2024-12-17T06:48:24Z) - Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs)
SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy.
evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z) - ULLME: A Unified Framework for Large Language Model Embeddings with Generation-Augmented Learning [72.90823351726374]
We introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that enables bidirectional attention across various LLMs.
We also propose Generation-augmented Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks.
To showcase our framework's flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone architectures.
arXiv Detail & Related papers (2024-08-06T18:53:54Z) - LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders [34.421335513040795]
Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks.
We introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder.
arXiv Detail & Related papers (2024-04-09T02:51:05Z) - Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs [61.04246774006429]
We introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent.<n>We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements.<n>Our findings show that instruction-tuned models can expose pre-training data as much as their base-models, if not more so, and using instructions proposed by other LLMs can open a new avenue of automated attacks.
arXiv Detail & Related papers (2024-03-05T19:32:01Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - Improving Open Information Extraction with Large Language Models: A
Study on Demonstration Uncertainty [52.72790059506241]
Open Information Extraction (OIE) task aims at extracting structured facts from unstructured text.
Despite the potential of large language models (LLMs) like ChatGPT as a general task solver, they lag behind state-of-the-art (supervised) methods in OIE tasks.
arXiv Detail & Related papers (2023-09-07T01:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.