AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
- URL: http://arxiv.org/abs/2510.10395v1
- Date: Sun, 12 Oct 2025 01:20:22 GMT
- Title: AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
- Authors: Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao, Yang Shi, Bozhou Li, Yuanxing Zhang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan,
- Abstract summary: We present AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities.<n> Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks.
- Score: 41.31057204656161
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. In this paper, we present AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. We propose a two-stage post-training pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2) AVoCaDO GRPO, which leverages tailored reward functions to further enhance temporal coherence and dialogue accuracy while regularizing caption length and reducing collapse. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance on the VDC and DREAM-1K benchmark under visual-only settings.
Related papers
- ALIVE: Animate Your World with Lifelike Audio-Video Generation [50.693986608051716]
ALIVE is a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation.<n>To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch.<n>ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions.
arXiv Detail & Related papers (2026-02-09T14:06:03Z) - Exploiting Temporal Audio-Visual Correlation Embedding for Audio-Driven One-Shot Talking Head Animation [62.218932509432314]
Inherently, the temporal relationship of adjacent audio clips is highly correlated with that of the corresponding adjacent video frames.<n>We learn audio-visual correlations and integrate the correlations to help enhance feature representation and regularize final generation.
arXiv Detail & Related papers (2025-04-08T07:23:28Z) - From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and Generation [17.95017332858846]
We introduce a novel framework called Vision to Audio and Beyond (VAB) to bridge the gap between audio-visual representation learning and vision-to-audio generation.
VAB uses a pre-trained audio tokenizer and an image encoder to obtain audio tokens and visual features, respectively.
Our experiments showcase the efficiency of VAB in producing high-quality audio from video, and its capability to acquire semantic audio-visual features.
arXiv Detail & Related papers (2024-09-27T20:26:34Z) - Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding [36.20990265600332]
We introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,000 pseudo-untrimmed videos with detailed temporal annotations.<n> PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering.<n>We develop AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens.
arXiv Detail & Related papers (2024-03-24T19:50:49Z) - Text-to-Audio Generation Synchronized with Videos [44.848393652233796]
We introduce a groundbreaking benchmark for Text-to-Audio generation that aligns with Videos, named T2AV-Bench.
We also present a simple yet effective video-aligned TTA generation model, namely T2AV.
It employs a temporal multi-head attention transformer to extract and understand temporal nuances from video data, a feat amplified by our Audio-Visual ControlNet.
arXiv Detail & Related papers (2024-03-08T22:27:38Z) - Bootstrapping Audio-Visual Segmentation by Strengthening Audio Cues [75.73217916395386]
We propose a Bidirectional Audio-Visual Decoder (BAVD) with integrated bidirectional bridges.
This interaction narrows the modality imbalance, facilitating more effective learning of integrated audio-visual representations.
We also present a strategy for audio-visual frame-wise synchrony as fine-grained guidance of BAVD.
arXiv Detail & Related papers (2024-02-04T03:02:35Z) - Audio-Visual Contrastive Learning with Temporal Self-Supervision [84.11385346896412]
We propose a self-supervised learning approach for videos that learns representations of both the RGB frames and the accompanying audio without human supervision.
To leverage the temporal and aural dimension inherent to videos, our method extends temporal self-supervision to the audio-visual setting.
arXiv Detail & Related papers (2023-02-15T15:00:55Z) - AudioVisual Video Summarization [103.47766795086206]
In video summarization, existing approaches just exploit the visual information while neglecting the audio information.
We propose to jointly exploit the audio and visual information for the video summarization task, and develop an AudioVisual Recurrent Network (AVRN) to achieve this.
arXiv Detail & Related papers (2021-05-17T08:36:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.