GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining
- URL: http://arxiv.org/abs/2601.19606v1
- Date: Tue, 27 Jan 2026 13:43:32 GMT
- Title: GMS-CAVP: Improving Audio-Video Correspondence with Multi-Scale Contrastive and Generative Pretraining
- Authors: Shentong Mo, Zehua Chen, Jun Zhu,
- Abstract summary: GMS-CAVP is a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives.<n>First, GMS-CAVP introduces a multi-scale contrastive learning strategy that captures semantic and temporal relations across varying granularities.<n>Second, we go beyond traditional contrastive learning by incorporating a diffusion-based generative objective, enabling modality translation and synthesis between video and audio.
- Score: 64.72014392166625
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in video-audio (V-A) understanding and generation have increasingly relied on joint V-A embeddings, which serve as the foundation for tasks such as cross-modal retrieval and generation. While prior methods like CAVP effectively model semantic and temporal correspondences between modalities using contrastive objectives, their performance remains suboptimal. A key limitation is the insufficient modeling of the dense, multi-scale nature of both video and audio signals, correspondences often span fine- to coarse-grained spatial-temporal structures, which are underutilized in existing frameworks. To this end, we propose GMS-CAVP, a novel framework that combines Multi-Scale Video-Audio Alignment and Multi-Scale Spatial-Temporal Diffusion-based pretraining objectives to enhance V-A correspondence modeling. First, GMS-CAVP introduces a multi-scale contrastive learning strategy that captures semantic and temporal relations across varying granularities. Second, we go beyond traditional contrastive learning by incorporating a diffusion-based generative objective, enabling modality translation and synthesis between video and audio. This unified discriminative-generative formulation facilitates deeper cross-modal understanding and paves the way for high-fidelity generation. Extensive experiments on VGGSound, AudioSet, and Panda70M demonstrate that GMS-CAVP outperforms previous methods in generation and retrieval.
Related papers
- JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation [112.614973927778]
Joint audio-video generation (JAVG) produces synchronized and semantically aligned sound and vision from textual descriptions.<n>This paper presents JavisDiT++, a framework for unified modeling and optimization of JAVG.<n>Our model achieves state-of-the-art performance merely with around 1M public training entries.
arXiv Detail & Related papers (2026-02-22T12:44:28Z) - LiveTalk: Real-Time Multimodal Interactive Video Diffusion via Improved On-Policy Distillation [35.01134463094784]
Real-time video generation via diffusion is essential for building general-purpose multimodal interactive AI systems.<n>Existing distillation methods can make the model autoregressive and reduce sampling steps to mitigate this.<n>This paper targets real-time interactive video diffusion conditioned on a multimodal context, including text, image, and audio, to bridge the gap.
arXiv Detail & Related papers (2025-12-29T16:17:36Z) - Complementary and Contrastive Learning for Audio-Visual Segmentation [74.11434759171199]
We present Complementary and Contrastive Transformer (CCFormer), a novel framework adept at processing both local and global information.<n>Our method sets new state-of-the-art benchmarks across the S4, MS3 and AVSS datasets.
arXiv Detail & Related papers (2025-10-11T06:36:59Z) - Scalable Audio-Visual Masked Autoencoders for Efficient Affective Video Facial Analysis [11.373305523732718]
Affective video facial analysis (AVFA) has emerged as a key research field for building emotion-aware intelligent systems.<n>Masked Autoencoders (MAE) has gained momentum, with growing adaptations in its audio-visual contexts.<n>AVF-MAE++ is a family of audio-visual MAE models designed to efficiently investigate the scaling properties in AVFA.
arXiv Detail & Related papers (2025-09-29T02:53:49Z) - MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization [10.717164013707693]
Current video-to-audio (V2A) methods struggle in complex multi-event scenarios.<n>This study proposes a novel V2A framework: MultiSoundGen.<n>It introduces direct preference optimization (DPO) into the V2A domain.
arXiv Detail & Related papers (2025-09-24T11:04:34Z) - HuMo: Human-Centric Video Generation via Collaborative Multi-Modal Conditioning [33.868900473146496]
We present HuMo, a framework for collaborative multimodal control.<n>HuMo surpasses specialized state-of-the-art methods in sub-tasks.
arXiv Detail & Related papers (2025-09-10T11:54:29Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z) - DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap [38.5017989456818]
DiffGAP is a novel approach incorporating a lightweight generative module within the contrastive space.<n>Our experimental results on VGGSound and AudioCaps datasets demonstrate that DiffGAP significantly improves performance in video/text-audio generation and retrieval tasks.
arXiv Detail & Related papers (2025-03-15T13:24:09Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation [41.292644854306594]
We propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture)
DiffGesture achieves state-of-theart performance, which renders coherent gestures with better mode coverage and stronger audio correlations.
arXiv Detail & Related papers (2023-03-16T07:32:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.