Unmasking Database Vulnerabilities: Zero-Knowledge Schema Inference Attacks in Text-to-SQL Systems
- URL: http://arxiv.org/abs/2406.14545v2
- Date: Thu, 17 Oct 2024 15:06:23 GMT
- Title: Unmasking Database Vulnerabilities: Zero-Knowledge Schema Inference Attacks in Text-to-SQL Systems
- Authors: Đorđe Klisura, Anthony Rios,
- Abstract summary: We introduce a novel zero-knowledge framework for reconstructing the underlying database schema of text-to-generative models without any prior knowledge of the database.
We demonstrate that our method achieves high accuracy in reconstructing table names, with F1 scores of up to.99 for generative models and.78 for fine-tuned models.
We propose a simple protection mechanism for generative models and empirically show its limitations in mitigating these attacks.
- Score: 7.613758211231583
- License:
- Abstract: Text-to-SQL systems empower users to interact with databases using natural language, automatically translating queries into executable SQL code. However, their reliance on database schema information for SQL generation exposes them to significant security vulnerabilities, particularly schema inference attacks that can lead to unauthorized data access or manipulation. In this paper, we introduce a novel zero-knowledge framework for reconstructing the underlying database schema of text-to-SQL models without any prior knowledge of the database. Our approach systematically probes text-to-SQL models with specially crafted questions and leverages a surrogate GPT-4 model to interpret the outputs, effectively uncovering hidden schema elements -- including tables, columns, and data types. We demonstrate that our method achieves high accuracy in reconstructing table names, with F1 scores of up to .99 for generative models and .78 for fine-tuned models, underscoring the severity of schema leakage risks. Furthermore, we propose a simple protection mechanism for generative models and empirically show its limitations in mitigating these attacks.
Related papers
- RSL-SQL: Robust Schema Linking in Text-to-SQL Generation [51.00761167842468]
We propose a novel framework called RSL- that combines bidirectional schema linking, contextual information augmentation, binary selection strategy, and multi-turn self-correction.
benchmarks demonstrate that our approach achieves SOTA execution accuracy among open-source solutions, with 67.2% on BIRD and 87.9% on GPT-4ocorrection.
Our approach outperforms a series of GPT-4 based Text-to-Seek systems when adopting DeepSeek (much cheaper) with same intact prompts.
arXiv Detail & Related papers (2024-10-31T16:22:26Z) - Synthesizing Text-to-SQL Data from Weak and Strong LLMs [68.69270834311259]
The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to- tasks.
We introduce a synthetic data approach that combines data produced by larger, more powerful models with error information data generated by smaller, not well-aligned models.
arXiv Detail & Related papers (2024-08-06T15:40:32Z) - TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring [11.78795632771211]
We introduce a novel benchmark designed to evaluate text-to- reliability as a model's ability to correctly handle any type of input question.
We evaluate existing methods using a novel penalty-based scoring metric with two modeling approaches.
arXiv Detail & Related papers (2024-03-23T16:12:52Z) - DBCopilot: Scaling Natural Language Querying to Massive Databases [47.009638761948466]
Existing methods face scalability challenges when dealing with massive, dynamically changing databases.
This paper introduces DBCopilot, a framework that employs a compact and flexible copilot model for routing across massive databases.
arXiv Detail & Related papers (2023-12-06T12:37:28Z) - On the Security Vulnerabilities of Text-to-SQL Models [34.749129843281196]
We show that modules within six commercial applications can be manipulated to produce malicious code.
This is the first demonstration that NLP models can be exploited as attack vectors in the wild.
The aim of this work is to draw the community's attention to potential software security issues associated with NLP algorithms.
arXiv Detail & Related papers (2022-11-28T14:38:45Z) - Proton: Probing Schema Linking Information from Pre-trained Language
Models for Text-to-SQL Parsing [66.55478402233399]
We propose a framework to elicit relational structures via a probing procedure based on Poincar'e distance metric.
Compared with commonly-used rule-based methods for schema linking, we found that probing relations can robustly capture semantic correspondences.
Our framework sets new state-of-the-art performance on three benchmarks.
arXiv Detail & Related papers (2022-06-28T14:05:25Z) - UniSAr: A Unified Structure-Aware Autoregressive Language Model for
Text-to-SQL [48.21638676148253]
We present UniSAr (Unified Structure-Aware Autoregressive Language Model), which benefits from using an off-the-shelf language model.
Specifically, UniSAr extends existing autoregressive models to incorporate three non-invasive extensions to make them structure-aware.
arXiv Detail & Related papers (2022-03-15T11:02:55Z) - IGSQL: Database Schema Interaction Graph Based Neural Model for
Context-Dependent Text-to-SQL Generation [61.09660709356527]
We propose a database schema interaction graph encoder to utilize historicalal information of database schema items.
We evaluate our model on the benchmark SParC and Co datasets.
arXiv Detail & Related papers (2020-11-11T12:56:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.