Related papers: Glimpse: Enabling White-Box Methods to Use Proprietary Models for Zero-Shot LLM-Generated Text Detection

Glimpse: Enabling White-Box Methods to Use Proprietary Models for Zero-Shot LLM-Generated Text Detection

URL: http://arxiv.org/abs/2412.11506v2
Date: Wed, 19 Feb 2025 07:37:55 GMT
Title: Glimpse: Enabling White-Box Methods to Use Proprietary Models for Zero-Shot LLM-Generated Text Detection
Authors: Guangsheng Bao, Yanbin Zhao, Juncai He, Yue Zhang,
Abstract summary: **Glimpse** is a probability distribution estimation approach, predicting the full distributions from partial observations.<n>We extend white-box methods like Entropy, Rank, Log-Rank, and Fast-DetectGPT to latest proprietary models.<n>Experiments show that Glimpse with Fast-DetectGPT and GPT-3.5 achieves an average AUROC of about 0.95 in five latest source models.
Score: 15.902823469821431
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Advanced large language models (LLMs) can generate text almost indistinguishable from human-written text, highlighting the importance of LLM-generated text detection. However, current zero-shot techniques face challenges as white-box methods are restricted to use weaker open-source LLMs, and black-box methods are limited by partial observation from stronger proprietary LLMs. It seems impossible to enable white-box methods to use proprietary models because API-level access to the models neither provides full predictive distributions nor inner embeddings. To traverse the divide, we propose **Glimpse**, a probability distribution estimation approach, predicting the full distributions from partial observations. Despite the simplicity of Glimpse, we successfully extend white-box methods like Entropy, Rank, Log-Rank, and Fast-DetectGPT to latest proprietary models. Experiments show that Glimpse with Fast-DetectGPT and GPT-3.5 achieves an average AUROC of about 0.95 in five latest source models, improving the score by 51% relative to the remaining space of the open source baseline. It demonstrates that the latest LLMs can effectively detect their own outputs, suggesting that advanced LLMs may be the best shield against themselves. We release our code and data at https://github.com/baoguangsheng/glimpse.

Related papers

Learning on LLM Output Signatures for gray-box LLM Behavior Analysis [52.81120759532526]
Large Language Models (LLMs) have achieved widespread adoption, yet our understanding of their behavior remains limited. We develop a transformer-based approach to process that theoretically guarantees approximation of existing techniques. Our approach achieves superior performance on hallucination and data contamination detection in gray-box settings.
arXiv Detail & Related papers (2025-03-18T09:04:37Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs) We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy. We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
FDLLM: A Text Fingerprint Detection Method for LLMs in Multi-Language, Multi-Domain Black-Box Environments [18.755880639770755]
Using large language models (LLMs) can lead to potential security risks. attackers may exploit this black-box scenario to deploy malicious models and embed viruses in the code provided to users. We propose the first LLMGT fingerprint detection model, textbfFDLLM, based on Qwen2.5-7B and fine-tuned using LoRA to address these challenges.
arXiv Detail & Related papers (2025-01-27T13:18:40Z)
A Watermark for Black-Box Language Models [31.772364827073808]
We propose a principled watermarking scheme that requires only the ability to sample sequences from the LLM. We provide performance guarantees, demonstrate how it can be leveraged when white-box access is available, and show when it can outperform existing white-box schemes via comprehensive experiments.
arXiv Detail & Related papers (2024-10-02T23:39:19Z)
DALD: Improving Logits-based Detector without Logits from Black-box LLMs [56.234109491884126]
Large Language Models (LLMs) have revolutionized text generation, producing outputs that closely mimic human writing. We present Distribution-Aligned LLMs Detection (DALD), an innovative framework that redefines the state-of-the-art performance in black-box text detection. DALD is designed to align the surrogate model's distribution with that of unknown target LLMs, ensuring enhanced detection capability and resilience against rapid model iterations.
arXiv Detail & Related papers (2024-06-07T19:38:05Z)
ReMoDetect: Reward Models Recognize Aligned LLM's Generations [55.06804460642062]
Large language models (LLMs) generate human-preferable texts. In this paper, we identify the common characteristics shared by these models. We propose two training schemes to further improve the detection ability of the reward model.
arXiv Detail & Related papers (2024-05-27T17:38:33Z)
Who Wrote This? The Key to Zero-Shot LLM-Generated Text Detection Is GECScore [51.65730053591696]
We propose a simple but effective black-box zero-shot detection approach. It is predicated on the observation that human-written texts typically contain more grammatical errors than LLM-generated texts. Our method achieves an average AUROC of 98.7% and shows strong robustness against paraphrase and adversarial perturbation attacks.
arXiv Detail & Related papers (2024-05-07T12:57:01Z)
Logits of API-Protected LLMs Leak Proprietary Information [46.014638838911566]
Large language model (LLM) providers often hide the architectural details and parameters of their proprietary models by restricting public access to a limited API. We show that it is possible to learn a surprisingly large amount of non-public information about an API-protected LLM from a relatively small number of API queries.
arXiv Detail & Related papers (2024-03-14T16:27:49Z)
DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature [143.5381108333212]
We show that text sampled from an large language model tends to occupy negative curvature regions of the model's log probability function. We then define a new curvature-based criterion for judging if a passage is generated from a given LLM. We find DetectGPT is more discriminative than existing zero-shot methods for model sample detection.
arXiv Detail & Related papers (2023-01-26T18:44:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.