Related papers: Rerouting LLM Routers

Rerouting LLM Routers

URL: http://arxiv.org/abs/2501.01818v1
Date: Fri, 03 Jan 2025 14:03:14 GMT
Title: Rerouting LLM Routers
Authors: Avital Shafran, Roei Schuster, Thomas Ristenpart, Vitaly Shmatikov,
Abstract summary: LLM routers balance quality and cost of generation by classifying queries and routing them to a cheaper or more expensive LLM depending on their complexity. In this paper, we investigate routers' adversarial robustness.
Score: 27.16232746301828
License:
Abstract: LLM routers aim to balance quality and cost of generation by classifying queries and routing them to a cheaper or more expensive LLM depending on their complexity. Routers represent one type of what we call LLM control planes: systems that orchestrate use of one or more LLMs. In this paper, we investigate routers' adversarial robustness. We first define LLM control plane integrity, i.e., robustness of LLM orchestration to adversarial inputs, as a distinct problem in AI safety. Next, we demonstrate that an adversary can generate query-independent token sequences we call ``confounder gadgets'' that, when added to any query, cause LLM routers to send the query to a strong LLM. Our quantitative evaluation shows that this attack is successful both in white-box and black-box settings against a variety of open-source and commercial routers, and that confounding queries do not affect the quality of LLM responses. Finally, we demonstrate that gadgets can be effective while maintaining low perplexity, thus perplexity-based filtering is not an effective defense. We finish by investigating alternative defenses.

Related papers

Universal Model Routing for Efficient LLM Inference [72.65083061619752]
We consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose a new approach to this problem that relies on representing each LLM as a feature vector, derived based on predictions on a set of representative prompts. We prove that these strategies are estimates of a theoretically optimal routing rule, and provide an excess risk bound to quantify their errors.
arXiv Detail & Related papers (2025-02-12T20:30:28Z)
QROA: A Black-Box Query-Response Optimization Attack on LLMs [2.7624021966289605]
Large Language Models (LLMs) have surged in popularity in recent months, yet they possess capabilities for generating harmful content when manipulated. This study introduces the Query-Response Optimization Attack (QROA), an optimization-based strategy designed to exploit LLMs through a black-box, query-only interaction.
arXiv Detail & Related papers (2024-06-04T07:27:36Z)
MoGU: A Framework for Enhancing Safety of Open-Sourced LLMs While Preserving Their Usability [25.750371424096436]
Large Language Models (LLMs) are increasingly deployed in various applications. Our research finds that existing defense strategies lead LLMs to predominantly adopt a rejection-oriented stance. We introduce the MoGU framework, designed to enhance LLMs' safety while preserving their usability.
arXiv Detail & Related papers (2024-05-23T12:19:59Z)
Towards Efficient LLM Grounding for Embodied Multi-Agent Collaboration [70.09561665520043]
We propose a novel framework for multi-agent collaboration that introduces Reinforced Advantage feedback (ReAd) for efficient self-refinement of plans. We provide theoretical analysis by extending advantage-weighted regression in reinforcement learning to multi-agent systems. Experiments on Over-AI and a difficult variant of RoCoBench show that ReAd surpasses baselines in success rate, and also significantly decreases the interaction steps of agents.
arXiv Detail & Related papers (2024-05-23T08:33:19Z)
Prompt Leakage effect and defense strategies for multi-turn LLM interactions [95.33778028192593]
Leakage of system prompts may compromise intellectual property and act as adversarial reconnaissance for an attacker. We design a unique threat model which leverages the LLM sycophancy effect and elevates the average attack success rate (ASR) from 17.7% to 86.2% in a multi-turn setting. We measure the mitigation effect of 7 black-box defense strategies, along with finetuning an open-source model to defend against leakage attempts.
arXiv Detail & Related papers (2024-04-24T23:39:58Z)
LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks [18.068035947969044]
There is considerable confusion about the role of Large Language Models (LLMs) in planning and reasoning tasks. We argue that auto-regressive LLMs cannot, by themselves, do planning or self-verification. We present a vision of bf LLM-Modulo Frameworks that combine the strengths of LLMs with external model-based verifiers.
arXiv Detail & Related papers (2024-02-02T14:43:18Z)
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming [72.2127916030909]
We propose a Multi-round Automatic Red-Teaming (MART) method, which incorporates both automatic adversarial prompt writing and safe response generation. On adversarial prompt benchmarks, the violation rate of an LLM with limited safety alignment reduces up to 84.7% after 4 rounds of MART. Notably, model helpfulness on non-adversarial prompts remains stable throughout iterations, indicating the target LLM maintains strong performance on instruction following.
arXiv Detail & Related papers (2023-11-13T19:13:29Z)
Attack Prompt Generation for Red Teaming and Defending Large Language Models [70.157691818224]
Large language models (LLMs) are susceptible to red teaming attacks, which can induce LLMs to generate harmful content. We propose an integrated approach that combines manual and automatic methods to economically generate high-quality attack prompts.
arXiv Detail & Related papers (2023-10-19T06:15:05Z)
Red Teaming Language Model Detectors with Language Models [114.36392560711022]
Large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. Recent works have proposed algorithms to detect LLM-generated text and protect LLMs. We study two types of attack strategies: 1) replacing certain words in an LLM's output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation.
arXiv Detail & Related papers (2023-05-31T10:08:37Z)
Large Language Model Is Not a Good Few-shot Information Extractor, but a Good Reranker for Hard Samples! [43.51393135075126]
Large Language Models (LLMs) have made remarkable strides in various tasks. We show that current advanced LLMs consistently exhibit inferior performance, higher latency, and increased budget requirements compared to fine-tuned SLMs. We propose an adaptive filter-then-rerank paradigm to combine the strengths of LLMs and SLMs.
arXiv Detail & Related papers (2023-03-15T12:20:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.