PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling
- URL: http://arxiv.org/abs/2505.23155v1
- Date: Thu, 29 May 2025 06:46:19 GMT
- Title: PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling
- Authors: Xiao Yu, Yan Fang, Xiaojie Jin, Yao Zhao, Yunchao Wei,
- Abstract summary: We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams.<n>We propose the Predictive Future Modeling framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues.<n>Experiments show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters.
- Score: 78.61911985138795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the Predictive Future Modeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and (b) modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization. Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding. Code is available at https://github.com/XiaoYu-1123/PreFM.
Related papers
- Engagement Prediction of Short Videos with Large Multimodal Models [46.954597097369586]
We empirically investigate the potential of large multimodal models (LMMs) for video engagement prediction.<n>VideoLLaMA2 processes key video frames, text-based metadata, and background sound, while Qwen2.5-VL utilizes only key video frames and text-based metadata.<n>By ensembling two types of models, our method achieves first place in the ICCV VQualA 2025 EVQA-SnapUGC Challenge on short-form video engagement prediction.
arXiv Detail & Related papers (2025-08-04T15:21:29Z) - Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models [13.887164304514101]
The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs)<n>In current AV-LLMs, audio and video features are typically processed jointly in the decoder.<n>We propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications.
arXiv Detail & Related papers (2025-05-27T08:22:56Z) - VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model [70.25062476543091]
VITA-Audio is an end-to-end large speech model with fast audio-text token generation.<n>MCTP module efficiently generates multiple audio tokens within a single model forward pass.<n>Four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality.
arXiv Detail & Related papers (2025-05-06T17:59:53Z) - Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion [12.212623921747264]
Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems.<n>We propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency and Vision-Language Modeling with Audio Enhancement (VLMAE)<n>This system deployed in production systems, leading to significant business gains.
arXiv Detail & Related papers (2025-03-21T21:55:05Z) - Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds [72.83227312675174]
We present a model-agnostic approach to the domain of audio-visual event perception.<n>Our approach includes a score-level fusion technique to retain richer multimodal interactions.<n>We also present the first training-free, open-vocabulary baseline for audio-visual event perception.
arXiv Detail & Related papers (2025-03-17T20:06:48Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio
Visual Event Localization [14.103742565510387]
We introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained on large-scale audio-visual data with a multi-window temporal transformer.
Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9% mean accuracy improvement.
arXiv Detail & Related papers (2022-10-11T00:15:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.