Active Light Modulation to Counter Manipulation of Speech Visual Content
- URL: http://arxiv.org/abs/2504.21846v1
- Date: Wed, 30 Apr 2025 17:55:24 GMT
- Title: Active Light Modulation to Counter Manipulation of Speech Visual Content
- Authors: Hadleigh Schwartz, Xiaofeng Yan, Charles J. Carver, Xia Zhou,
- Abstract summary: Spotlight is a low-overhead and unobtrusive system for protecting live speech videos from falsification.<n> Spotlight creates dynamic physical signatures at the event site and embeds them into all video recordings via imperceptible light.<n>Prototype experiments show Spotlight achieves AUCs $geq$ 0.99 and an overall true positive rate of 100% in detecting falsified videos.
- Score: 1.471374083774109
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-profile speech videos are prime targets for falsification, owing to their accessibility and influence. This work proposes Spotlight, a low-overhead and unobtrusive system for protecting live speech videos from visual falsification of speaker identity and lip and facial motion. Unlike predominant falsification detection methods operating in the digital domain, Spotlight creates dynamic physical signatures at the event site and embeds them into all video recordings via imperceptible modulated light. These physical signatures encode semantically-meaningful features unique to the speech event, including the speaker's identity and facial motion, and are cryptographically-secured to prevent spoofing. The signatures can be extracted from any video downstream and validated against the portrayed speech content to check its integrity. Key elements of Spotlight include (1) a framework for generating extremely compact (i.e., 150-bit), pose-invariant speech video features, based on locality-sensitive hashing; and (2) an optical modulation scheme that embeds >200 bps into video while remaining imperceptible both in video and live. Prototype experiments on extensive video datasets show Spotlight achieves AUCs $\geq$ 0.99 and an overall true positive rate of 100% in detecting falsified videos. Further, Spotlight is highly robust across recording conditions, video post-processing techniques, and white-box adversarial attacks on its video feature extraction methodologies.
Related papers
- SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation [50.03810359300705]
SpA2V decomposes the generation process into two stages: audio-guided video planning and layout-grounded video generation.<n>We show that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.
arXiv Detail & Related papers (2025-08-01T17:05:04Z) - Noise-Coded Illumination for Forensic and Photometric Video Analysis [11.507609566604664]
We show how coding very subtle, noise-like modulations into the illumination of a scene can help combat this advantage.<n>Our approach effectively adds a temporal watermark to any video recorded under coded illumination.<n>We show that even when an adversary knows that our technique is being used, creating a plausible coded fake video amounts to solving a second, more difficult version of the original adversarial content creation problem.
arXiv Detail & Related papers (2025-07-30T18:08:34Z) - From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech [26.67378997911053]
The objective of this study is to generate high-quality speech from silent talking face videos.<n>We propose a novel video-to-speech system that bridges the modality gap between silent video and multi-faceted speech.<n>Our method achieves exceptional generation quality comparable to real utterances.
arXiv Detail & Related papers (2025-03-21T09:02:38Z) - Shushing! Let's Imagine an Authentic Speech from the Silent Video [15.426152742881365]
Vision-guided speech generation aims to produce authentic speech from facial appearance or lip motions without relying on auditory signals.<n>Despite recent progress, existing methods struggle to achieve unified cross-modal alignment across semantics, timbre, and emotional prosody from visual cues.<n>We introduce ImaginTalk, a novel cross-modal diffusion framework that generates faithful speech using only visual input.
arXiv Detail & Related papers (2025-03-19T06:28:17Z) - SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion [78.77211425667542]
SayAnything is a conditional video diffusion framework that directly synthesizes lip movements from audio input.<n>Our novel design effectively balances different condition signals in the latent space, enabling precise control over appearance, motion, and region-specific generation.
arXiv Detail & Related papers (2025-02-17T07:29:36Z) - MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation [55.95148886437854]
Memory-guided EMOtion-aware diffusion (MEMO) is an end-to-end audio-driven portrait animation approach to generate talking videos.
MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, audio-lip synchronization, identity consistency, and expression-emotion alignment.
arXiv Detail & Related papers (2024-12-05T18:57:26Z) - Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation [54.21476271127356]
Divot is a Diffusion-Powered Video Tokenizer.<n>We present Divot-unaVic through video-to-text autoregression and text-to-video generation.
arXiv Detail & Related papers (2024-12-05T18:53:04Z) - Storyboard guided Alignment for Fine-grained Video Action Recognition [32.02631248389487]
Fine-grained video action recognition can be conceptualized as a video-text matching problem.
We propose a multi-granularity framework based on two observations: (i) videos with different global semantics may share similar atomic actions or appearances, and (ii) atomic actions within a video can be momentary, slow, or even non-directly related to the global video semantics.
arXiv Detail & Related papers (2024-10-18T07:40:41Z) - Unsupervised Video Highlight Detection by Learning from Audio and Visual Recurrence [13.2968942989609]
We focus on unsupervised video highlight detection, eliminating the need for manual annotations.
Through a clustering technique, we identify pseudo-categories of videos and compute audio pseudo-highlight scores for each video.
We also compute visual pseudo-highlight scores for each video using visual features.
arXiv Detail & Related papers (2024-07-18T23:09:14Z) - DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided
Speaker Embedding [52.84475402151201]
We present a vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique.
We further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video.
Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.
arXiv Detail & Related papers (2023-08-15T14:07:41Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z) - Audio-Visual Person-of-Interest DeepFake Detection [77.04789677645682]
The aim of this work is to propose a deepfake detector that can cope with the wide variety of manipulation methods and scenarios encountered in the real world.
We leverage a contrastive learning paradigm to learn the moving-face and audio segment embeddings that are most discriminative for each identity.
Our method can detect both single-modality (audio-only, video-only) and multi-modality (audio-video) attacks, and is robust to low-quality or corrupted videos.
arXiv Detail & Related papers (2022-04-06T20:51:40Z) - Speech2Video: Cross-Modal Distillation for Speech to Video Generation [21.757776580641902]
Speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries.
The challenge mainly lies in disentangling the distinct visual attributes from audio signals.
We propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs.
arXiv Detail & Related papers (2021-07-10T10:27:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.