DALD: Improving Logits-based Detector without Logits from Black-box LLMs
- URL: http://arxiv.org/abs/2406.05232v4
- Date: Sun, 27 Oct 2024 09:55:40 GMT
- Title: DALD: Improving Logits-based Detector without Logits from Black-box LLMs
- Authors: Cong Zeng, Shengkun Tang, Xianjun Yang, Yuanzhou Chen, Yiyou Sun, zhiqiang xu, Yao Li, Haifeng Chen, Wei Cheng, Dongkuan Xu,
- Abstract summary: Large Language Models (LLMs) have revolutionized text generation, producing outputs that closely mimic human writing.
We present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection.
DALD is designed to align the surrogate model's distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations.
- Score: 56.234109491884126
- License:
- Abstract: The advent of Large Language Models (LLMs) has revolutionized text generation, producing outputs that closely mimic human writing. This blurring of lines between machine- and human-written text presents new challenges in distinguishing one from the other a task further complicated by the frequent updates and closed nature of leading proprietary LLMs. Traditional logits-based detection methods leverage surrogate models for identifying LLM-generated content when the exact logits are unavailable from black-box LLMs. However, these methods grapple with the misalignment between the distributions of the surrogate and the often undisclosed target models, leading to performance degradation, particularly with the introduction of new, closed-source models. Furthermore, while current methodologies are generally effective when the source model is identified, they falter in scenarios where the model version remains unknown, or the test set comprises outputs from various source models. To address these limitations, we present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection even without logits from source LLMs. DALD is designed to align the surrogate model's distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations with minimal training investment. By leveraging corpus samples from publicly accessible outputs of advanced models such as ChatGPT, GPT-4 and Claude-3, DALD fine-tunes surrogate models to synchronize with unknown source model distributions effectively.
Related papers
- SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration [10.970637831760136]
Speculative decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs)
We introduce SWIFT, an on-the-fly self-speculative decoding algorithm that adaptively selects intermediate layers of LLMs to skip during inference.
We show that SWIFT can achieve over a 1.3x-1.6x speedup while preserving the original distribution of the generated text.
arXiv Detail & Related papers (2024-10-09T14:15:30Z) - Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs)
SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy.
evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z) - DECIDER: Leveraging Foundation Model Priors for Improved Model Failure Detection and Explanation [18.77296551727931]
We propose DECIDER, a novel approach that leverages priors from large language models (LLMs) and vision-language models (VLMs) to detect failures in image models.
DECIDER consistently achieves state-of-the-art failure detection performance, significantly outperforming baselines in terms of the overall Matthews correlation coefficient.
arXiv Detail & Related papers (2024-08-01T07:08:11Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - A Fingerprint for Large Language Models [10.63985246068255]
We propose a novel black-box fingerprinting technique for large language models (LLMs)
Experimental results indicate that the proposed technique achieves superior performance in ownership verification and robustness against PEFT attacks.
arXiv Detail & Related papers (2024-07-01T12:25:42Z) - Preserving Knowledge in Large Language Model with Model-Agnostic Self-Decompression [40.4998607679863]
Large Language Models (LLMs) often suffer from catastrophic forgetting when post-pretrained or supervised fine-tuned (SFT) on domain-specific data.
This paper focuses on TG-SFT, which can synthetically generate SFT data for the instruction tuning steps.
arXiv Detail & Related papers (2024-06-17T09:17:40Z) - ReMoDetect: Reward Models Recognize Aligned LLM's Generations [55.06804460642062]
Large language models (LLMs) generate human-preferable texts.
In this paper, we identify the common characteristics shared by these models.
We propose two training schemes to further improve the detection ability of the reward model.
arXiv Detail & Related papers (2024-05-27T17:38:33Z) - Learnable Linguistic Watermarks for Tracing Model Extraction Attacks on Large Language Models [20.44680783275184]
Current watermarking techniques against model extraction attacks rely on signal insertion in model logits or post-processing of generated text.
We propose a novel method for embedding learnable linguistic watermarks in Large Language Models (LLMs)
Our approach subtly modifies the LLM's output distribution by introducing controlled noise into token frequency distributions, embedding a statistically identifiable watermark.
arXiv Detail & Related papers (2024-04-28T14:45:53Z) - Self-correcting LLM-controlled Diffusion Models [83.26605445217334]
We introduce Self-correcting LLM-controlled Diffusion (SLD)
SLD is a framework that generates an image from the input prompt, assesses its alignment with the prompt, and performs self-corrections on the inaccuracies in the generated image.
Our approach can rectify a majority of incorrect generations, particularly in generative numeracy, attribute binding, and spatial relationships.
arXiv Detail & Related papers (2023-11-27T18:56:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.