MeaeQ: Mount Model Extraction Attacks with Efficient Queries
- URL: http://arxiv.org/abs/2310.14047v1
- Date: Sat, 21 Oct 2023 16:07:16 GMT
- Title: MeaeQ: Mount Model Extraction Attacks with Efficient Queries
- Authors: Chengwei Dai, Minxuan Lv, Kun Li, Wei Zhou
- Abstract summary: We study model extraction attacks in natural language processing (NLP)
We propose MeaeQ, a straightforward yet effective method to address these issues.
MeaeQ achieves higher functional similarity to the victim model than baselines while requiring fewer queries.
- Score: 6.1106195466129485
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study model extraction attacks in natural language processing (NLP) where
attackers aim to steal victim models by repeatedly querying the open
Application Programming Interfaces (APIs). Recent works focus on limited-query
budget settings and adopt random sampling or active learning-based sampling
strategies on publicly available, unannotated data sources. However, these
methods often result in selected queries that lack task relevance and data
diversity, leading to limited success in achieving satisfactory results with
low query costs. In this paper, we propose MeaeQ (Model extraction attack with
efficient Queries), a straightforward yet effective method to address these
issues. Specifically, we initially utilize a zero-shot sequence inference
classifier, combined with API service information, to filter task-relevant data
from a public text corpus instead of a problem domain-specific dataset.
Furthermore, we employ a clustering-based data reduction technique to obtain
representative data as queries for the attack. Extensive experiments conducted
on four benchmark datasets demonstrate that MeaeQ achieves higher functional
similarity to the victim model than baselines while requiring fewer queries.
Our code is available at https://github.com/C-W-D/MeaeQ.
Related papers
- Exploring Query Efficient Data Generation towards Data-free Model Stealing in Hard Label Setting [38.755154033324374]
Data-free model stealing involves replicating the functionality of a target model into a substitute model without accessing the target model's structure, parameters, or training data.
This paper presents a new data-free model stealing approach called Query Efficient Data Generation (textbfQEDG)
We introduce two distinct loss functions to ensure the generation of sufficient samples that closely and uniformly align with the target model's decision boundary.
arXiv Detail & Related papers (2024-12-18T03:03:15Z) - Leveraging Foundation Language Models (FLMs) for Automated Cohort Extraction from Large EHR Databases [50.552056536968166]
We propose and evaluate an algorithm for automating column matching on two large, popular and publicly-accessible EHR databases.
Our approach achieves a high top-three accuracy of $92%$, correctly matching $12$ out of the $13$ columns of interest, when using a small, pre-trained general purpose language model.
arXiv Detail & Related papers (2024-12-16T06:19:35Z) - One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering [31.025439143093585]
Vision-Language Models (VLMs) have shown significant promise in Visual Question Answering (VQA) tasks by leveraging web-scale multimodal datasets.
These models often struggle with continual learning due to catastrophic forgetting when adapting to new tasks.
We propose the first data-free method that leverages the language generation capability of a VLM, instead of relying on external models.
arXiv Detail & Related papers (2024-11-04T16:04:59Z) - Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Prompt Optimization with EASE? Efficient Ordering-aware Automated Selection of Exemplars [66.823588073584]
Large language models (LLMs) have shown impressive capabilities in real-world applications.
The quality of these exemplars in the prompt greatly impacts performance.
Existing methods fail to adequately account for the impact of exemplar ordering on the performance.
arXiv Detail & Related papers (2024-05-25T08:23:05Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Revisiting Sparse Retrieval for Few-shot Entity Linking [33.15662306409253]
We propose an ELECTRA-based keyword extractor to denoise the mention context and construct a better query expression.
For training the extractor, we propose a distant supervision method to automatically generate training data based on overlapping tokens between mention contexts and entity descriptions.
Experimental results on the ZESHEL dataset demonstrate that the proposed method outperforms state-of-the-art models by a significant margin across all test domains.
arXiv Detail & Related papers (2023-10-19T03:51:10Z) - MinPrompt: Graph-based Minimal Prompt Data Augmentation for Few-shot Question Answering [64.6741991162092]
We present MinPrompt, a minimal data augmentation framework for open-domain question answering.
We transform the raw text into a graph structure to build connections between different factual sentences.
We then apply graph algorithms to identify the minimal set of sentences needed to cover the most information in the raw text.
We generate QA pairs based on the identified sentence subset and train the model on the selected sentences to obtain the final model.
arXiv Detail & Related papers (2023-10-08T04:44:36Z) - Going beyond research datasets: Novel intent discovery in the industry
setting [60.90117614762879]
This paper proposes methods to improve the intent discovery pipeline deployed in a large e-commerce platform.
We show the benefit of pre-training language models on in-domain data: both self-supervised and with weak supervision.
We also devise the best method to utilize the conversational structure (i.e., question and answer) of real-life datasets during fine-tuning for clustering tasks, which we call Conv.
arXiv Detail & Related papers (2023-05-09T14:21:29Z) - Combing for Credentials: Active Pattern Extraction from Smart Reply [15.097010165958027]
We investigate potential information leakage vulnerabilities in a typical Smart Reply pipeline.
We introduce a new type of active extraction attack that exploits canonical patterns in text containing sensitive data.
We show experimentally that it is possible for an adversary to extract sensitive user information present in the training data, even in realistic settings.
arXiv Detail & Related papers (2022-07-14T05:03:56Z) - First to Possess His Statistics: Data-Free Model Extraction Attack on
Tabular Data [0.0]
This paper presents a novel model extraction attack, named TEMPEST, under a practical data-free setting.
Experiments show that our attack can achieve the same level of performance as the previous attacks.
We discuss a possibility whereby TEMPEST is executed in the real world through an experiment with a medical diagnosis.
arXiv Detail & Related papers (2021-09-30T05:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.