VidLeaks: Membership Inference Attacks Against Text-to-Video Models
- URL: http://arxiv.org/abs/2601.11210v1
- Date: Fri, 16 Jan 2026 11:35:52 GMT
- Title: VidLeaks: Membership Inference Attacks Against Text-to-Video Models
- Authors: Li Wang, Wenyu Chen, Ning Yu, Zheng Li, Shanqing Guo,
- Abstract summary: Membership inference attacks (MIAs) provide a principled tool for auditing copyright and privacy violations.<n>We introduce a novel framework VidLeaks, which probes sparse-temporal memorization through two complementary signals.<n>Our work provides the first concrete evidence that T2V leak models substantial membership information through both sparse memorization and temporal memorization.
- Score: 17.443499650679964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The proliferation of powerful Text-to-Video (T2V) models, trained on massive web-scale datasets, raises urgent concerns about copyright and privacy violations. Membership inference attacks (MIAs) provide a principled tool for auditing such risks, yet existing techniques - designed for static data like images or text - fail to capture the spatio-temporal complexities of video generation. In particular, they overlook the sparsity of memorization signals in keyframes and the instability introduced by stochastic temporal dynamics. In this paper, we conduct the first systematic study of MIAs against T2V models and introduce a novel framework VidLeaks, which probes sparse-temporal memorization through two complementary signals: 1) Spatial Reconstruction Fidelity (SRF), using a Top-K similarity to amplify spatial memorization signals from sparsely memorized keyframes, and 2) Temporal Generative Stability (TGS), which measures semantic consistency across multiple queries to capture temporal leakage. We evaluate VidLeaks under three progressively restrictive black-box settings - supervised, reference-based, and query-only. Experiments on three representative T2V models reveal severe vulnerabilities: VidLeaks achieves AUC of 82.92% on AnimateDiff and 97.01% on InstructVideo even in the strict query-only setting, posing a realistic and exploitable privacy risk. Our work provides the first concrete evidence that T2V models leak substantial membership information through both sparse and temporal memorization, establishing a foundation for auditing video generation systems and motivating the development of new defenses. Code is available at: https://zenodo.org/records/17972831.
Related papers
- T2VAttack: Adversarial Attack on Text-to-Video Diffusion Models [67.13397169618624]
We introduce T2VAttack, a study of adversarial attacks on Text-to-Video (T2V) models from both semantic and temporal perspectives.<n>To achieve an effective and efficient attack process, we propose two adversarial attack methods: T2VAttack-S, which identifies semantically or temporally critical words in prompts and replaces them with synonyms via greedy search, and T2VAttack-I, which iteratively inserts optimized words with minimal perturbation to the prompt.
arXiv Detail & Related papers (2025-12-30T03:00:46Z) - Skeletons Speak Louder than Text: A Motion-Aware Pretraining Paradigm for Video-Based Person Re-Identification [8.135364788458423]
Multimodal pretraining has revolutionized visual understanding, but its impact on person-based person re-identification (ReID) remains underexplored.<n>Existing approaches often rely on video-text pairs, yet suffer from two fundamental limitations: (1) lack of genuine multimodal pretraining, and (2) text poorly captures fine-grained temporal motion.<n>We take a bold departure from text-based paradigms by introducing the first skeleton-driven pretraining framework for ReID.
arXiv Detail & Related papers (2025-11-17T08:59:41Z) - Privacy Beyond Pixels: Latent Anonymization for Privacy-Preserving Video Understanding [56.369026347458835]
We introduce a novel formulation of visual privacy preservation for video foundation models that operates entirely in the latent space.<n>Current privacy preservation methods on input-pixel-level anonymization require retraining the entire utility video model.<n>A lightweight Anonym Adapter Module (AAM) removes private information from video features while retaining general task utility.
arXiv Detail & Related papers (2025-11-11T18:56:27Z) - Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence [70.2803680525165]
We introduce Open-o3 Video, a non-agent framework that integrates explicit evidence into video reasoning.<n>The model highlights key objects and bounding boxes alongside its answers, allowing reasoning to be grounded in concrete visual observations.<n>On V-STAR benchmark, Open-o3 Video achieves state-of-the-art performance, raising mAM by 14.4% and mL timestamp by 24.2%.
arXiv Detail & Related papers (2025-10-23T14:05:56Z) - BadVideo: Stealthy Backdoor Attack against Text-to-Video Generation [37.055665794706336]
Text-to-video (T2V) generative models have rapidly advanced and found widespread applications across fields like entertainment, education, and marketing.<n>We observe that in T2V generation tasks, the generated videos often contain substantial redundant information not explicitly specified in the text prompts.<n>We introduce BadVideo, the first backdoor attack framework tailored for T2V generation.
arXiv Detail & Related papers (2025-04-23T17:34:48Z) - T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models [88.63040835652902]
Text to video models are vulnerable to jailbreak attacks, where specially crafted prompts bypass safety mechanisms and lead to the generation of harmful or unsafe content.<n>We propose T2VShield, a comprehensive and model agnostic defense framework designed to protect text to video models from jailbreak threats.<n>Our method systematically analyzes the input, model, and output stages to identify the limitations of existing defenses.
arXiv Detail & Related papers (2025-04-22T01:18:42Z) - When the Future Becomes the Past: Taming Temporal Correspondence for Self-supervised Video Representation Learning [80.09819072780193]
We propose a self-supervised framework that leverages Temporal Correspondence for video representation learning (T-CoRe)<n>Experiments of T-CoRe consistently present superior performance across several downstream tasks, demonstrating its effectiveness for video representation learning.
arXiv Detail & Related papers (2025-03-19T10:50:03Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.