Related papers: Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation

Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation

URL: http://arxiv.org/abs/2507.02271v1
Date: Thu, 03 Jul 2025 03:23:11 GMT
Title: Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation
Authors: Feizhen Huang, Yu Wu, Yutian Lin, Bo Du,
Abstract summary: We propose a self-distillation approach to extend V2A models to cinematic language scenarios.<n>By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences.<n>Our method achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.
Score: 34.67832016708788
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.

Related papers

SpA2V: Harnessing Spatial Auditory Cues for Audio-driven Spatially-aware Video Generation [50.03810359300705]
SpA2V decomposes the generation process into two stages: audio-guided video planning and layout-grounded video generation.<n>We show that SpA2V excels in generating realistic videos with semantic and spatial alignment to the input audios.
arXiv Detail & Related papers (2025-08-01T17:05:04Z)
Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup [2.80888070977859]
We propose audio-visual SSL for video action recognition, which uses both visual and audio together.<n>In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed framework.
arXiv Detail & Related papers (2025-03-04T05:13:56Z)
From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation [17.95017332858846]
We introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation. VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively. Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features.
arXiv Detail & Related papers (2024-09-27T20:26:34Z)
DiffV2S: Diffusion-based Video-to-Speech Synthesis with Vision-guided Speaker Embedding [52.84475402151201]
We present a vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique. We further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video. Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique.
arXiv Detail & Related papers (2023-08-15T14:07:41Z)
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z)
Fine-grained Audible Video Description [61.81122862375985]
We construct the first fine-grained audible video description benchmark (FAVDBench) For each video clip, we first provide a one-sentence summary of the video, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end. We demonstrate that employing fine-grained video descriptions can create more intricate videos than using captions.
arXiv Detail & Related papers (2023-03-27T22:03:48Z)
Video-Guided Curriculum Learning for Spoken Video Grounding [65.49979202728167]
We introduce a new task, spoken video grounding (SVG), which aims to localize the desired video fragments from spoken language descriptions. To rectify the discriminative phonemes and extract video-related information from noisy audio, we develop a novel video-guided curriculum learning (VGCL) In addition, we collect the first large-scale spoken video grounding dataset based on ActivityNet.
arXiv Detail & Related papers (2022-09-01T07:47:01Z)
OWL (Observe, Watch, Listen): Localizing Actions in Egocentric Video via Audiovisual Temporal Context [58.932717614439916]
We take a deep look into the effectiveness of audio in detecting actions in egocentric videos. We propose a transformer-based model to incorporate temporal audio-visual context. Our approach achieves state-of-the-art performance on EPIC-KITCHENS-100.
arXiv Detail & Related papers (2022-02-10T10:50:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.