Alignment-Aware Model Extraction Attacks on Large Language Models
        - URL: http://arxiv.org/abs/2409.02718v1
 - Date: Wed, 4 Sep 2024 13:54:38 GMT
 - Title: Alignment-Aware Model Extraction Attacks on Large Language Models
 - Authors: Zi Liang, Qingqing Ye, Yanyun Wang, Sen Zhang, Yaxin Xiao, Ronghua Li, Jianliang Xu, Haibo Hu, 
 - Abstract summary: We present Locality Reinforced Distillation (LoRD), a novel model extraction attack algorithm specifically for large language models (LLMs)
In particular, we design a policy-gradient-style training task, which utilizes victim models' responses as a signal to guide the crafting of preference for the local model.
LoRD can reduce query complexity while mitigating watermark protection through exploration-based stealing.
 - Score: 23.79690793366511
 - License: http://creativecommons.org/licenses/by/4.0/
 - Abstract:   Model extraction attacks (MEAs) on large language models (LLMs) have received increasing research attention lately. Existing attack methods on LLMs inherit the extraction strategies from those designed for deep neural networks (DNNs) yet neglect the inconsistency of training tasks between MEA and LLMs' alignments. As such, they result in poor attack performances. To tackle this issue, we present Locality Reinforced Distillation (LoRD), a novel model extraction attack algorithm specifically for LLMs. In particular, we design a policy-gradient-style training task, which utilizes victim models' responses as a signal to guide the crafting of preference for the local model. Theoretical analysis has shown that i) LoRD's convergence procedure in MEAs is consistent with the alignments of LLMs, and ii) LoRD can reduce query complexity while mitigating watermark protection through exploration-based stealing. Extensive experiments on domain-specific extractions demonstrate the superiority of our method by examining the extraction of various state-of-the-art commercial LLMs. 
 
       
      
        Related papers
        - Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv  Detail & Related papers  (2025-07-26T07:53:11Z) - LongLLaDA: Unlocking Long Context Capabilities in Diffusion LLMs [63.580867975515474]
We present the first systematic investigation comparing the long-context performance of diffusion LLMs and traditional auto-regressive LLMs.<n>We propose LongLLaDA, a training-free method that integrates LLaDA with the NTK-based RoPE extrapolation.
arXiv  Detail & Related papers  (2025-06-17T11:45:37Z) - Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition   Chain-of-Thought [58.321044666612174]
Vad-R1 is an end-to-end MLLM-based framework for Video Anomaly Reasoning.<n>We design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies.<n>We also propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs.
arXiv  Detail & Related papers  (2025-05-26T12:05:16Z) - R1-Searcher: Incentivizing the Search Capability in LLMs via   Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.
Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.
Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv  Detail & Related papers  (2025-03-07T17:14:44Z) - Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM   Reasoning via Autoregressive Search [57.28671084993782]
Large language models (LLMs) have demonstrated remarkable reasoning capabilities across diverse domains.
Recent studies have shown that increasing test-time computation enhances LLMs' reasoning capabilities.
We propose a two-stage training paradigm: 1) a small-scale format tuning stage to internalize the COAT reasoning format and 2) a large-scale self-improvement stage leveraging reinforcement learning.
arXiv  Detail & Related papers  (2025-02-04T17:26:58Z) - LLMQuoter: Enhancing RAG Capabilities Through Efficient Quote Extraction   From Large Contexts [2.685668802278156]
We introduce LLMQuoter, a lightweight, distillation-based model designed to enhance Retrieval Augmented Generation (RAG)
Built on the LLaMA-3B architecture and fine-tuned with Low-Rank Adaptation (LoRA) on a 15,000-sample subset of HotpotQA, LLMQuoter adopts a "quote-first-then-answer" strategy, efficiently identifying key quotes before passing curated snippets to reasoning models.
This workflow reduces cognitive overhead and outperforms full-context approaches like Retrieval-Augmented Fine-Tuning (RAFT), achieving over 20-point accuracy gains across both small and large language
arXiv  Detail & Related papers  (2025-01-09T20:01:15Z) - Attention Tracker: Detecting Prompt Injection Attacks in LLMs [62.247841717696765]
Large Language Models (LLMs) have revolutionized various domains but remain vulnerable to prompt injection attacks.
We introduce the concept of the distraction effect, where specific attention heads shift focus from the original instruction to the injected instruction.
We propose Attention Tracker, a training-free detection method that tracks attention patterns on instruction to detect prompt injection attacks.
arXiv  Detail & Related papers  (2024-11-01T04:05:59Z) - Reinforcement Learning for Aligning Large Language Models Agents with   Interactive Environments: Quantifying and Mitigating Prompt Overfitting [40.78026627009521]
Reinforcement learning (RL) is a promising approach for aligning large language models (LLMs) knowledge with sequential decision-making tasks.
We propose a novel framework to analyze the sensitivity of LLMs to prompt formulations following RL training in a textual environment.
arXiv  Detail & Related papers  (2024-10-25T18:25:35Z) - EVOLvE: Evaluating and Optimizing LLMs For Exploration [76.66831821738927]
Large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty.
We measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications.
Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs.
arXiv  Detail & Related papers  (2024-10-08T17:54:03Z) - A Fingerprint for Large Language Models [10.63985246068255]
We propose a novel black-box fingerprinting technique for large language models (LLMs)
 Experimental results indicate that the proposed technique achieves superior performance in ownership verification and robustness against PEFT attacks.
arXiv  Detail & Related papers  (2024-07-01T12:25:42Z) - Intermediate Distillation: Data-Efficient Distillation from Black-Box   LLMs for Information Retrieval [7.441679541836913]
textit Intermediate Distillation treats large language models as black boxes and distills their knowledge via an innovative LLM-ranker-retriever pipeline.
Our proposed method can significantly improve the performance of retriever models with only 1,000 training instances.
arXiv  Detail & Related papers  (2024-06-18T00:41:41Z) - Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for   Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands.
Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs.
In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv  Detail & Related papers  (2024-06-13T07:57:27Z) - Defending Large Language Models Against Attacks With Residual Stream   Activation Analysis [0.0]
Large Language Models (LLMs) are vulnerable to adversarial threats.
This paper presents an innovative defensive strategy, given white box access to an LLM.
We apply a novel methodology for analyzing distinctive activation patterns in the residual streams for attack prompt classification.
arXiv  Detail & Related papers  (2024-06-05T13:06:33Z) - From Words to Actions: Unveiling the Theoretical Underpinnings of   LLM-Driven Autonomous Systems [59.40480894948944]
Large language model (LLM) empowered agents are able to solve decision-making problems in the physical world.
Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting.
We prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning.
arXiv  Detail & Related papers  (2024-05-30T09:42:54Z) - Improve Temporal Awareness of LLMs for Sequential Recommendation [61.723928508200196]
Large language models (LLMs) have demonstrated impressive zero-shot abilities in solving a wide range of general-purpose tasks.
LLMs fall short in recognizing and utilizing temporal information, rendering poor performance in tasks that require an understanding of sequential data.
We propose three prompting strategies to exploit temporal information within historical interactions for LLM-based sequential recommendation.
arXiv  Detail & Related papers  (2024-05-05T00:21:26Z) - Unveiling the Misuse Potential of Base Large Language Models via   In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv  Detail & Related papers  (2024-04-16T13:22:54Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv  Detail & Related papers  (2024-02-25T20:07:13Z) - Semantically Aligned Task Decomposition in Multi-Agent Reinforcement
  Learning [56.26889258704261]
We propose a novel "disentangled" decision-making method, Semantically Aligned task decomposition in MARL (SAMA)
SAMA prompts pretrained language models with chain-of-thought that can suggest potential goals, provide suitable goal decomposition and subgoal allocation as well as self-reflection-based replanning.
SAMA demonstrates considerable advantages in sample efficiency compared to state-of-the-art ASG methods.
arXiv  Detail & Related papers  (2023-05-18T10:37:54Z) - On Extracting Specialized Code Abilities from Large Language Models: A
  Feasibility Study [22.265542509143756]
We investigate the feasibility of launching imitation attacks on large language models (LLMs)
We show that attackers can train a medium-sized backbone model to replicate specialized code behaviors similar to the target LLMs.
arXiv  Detail & Related papers  (2023-03-06T10:34:41Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.