Related papers: UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization

UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization

URL: http://arxiv.org/abs/2404.03179v2
Date: Mon, 12 Aug 2024 03:31:57 GMT
Title: UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization
Authors: Tiantian Geng, Teng Wang, Yanfu Zhang, Jinming Duan, Weili Guan, Feng Zheng, Ling shao,
Abstract summary: Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL) We present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time.
Score: 83.89550658314741
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL). Existing methods over-specialize on each task, overlooking the fact that these instances often occur in the same video to form the complete video content. In this work, we present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time. UniAV can leverage diverse data available in task-specific datasets, allowing the model to learn and share mutually beneficial knowledge across tasks and modalities. To tackle the challenges posed by substantial variations in datasets (size/domain/duration) and distinct task characteristics, we propose to uniformly encode visual and audio modalities of all videos to derive generic representations, while also designing task-specific experts to capture unique knowledge for each task. Besides, we develop a unified language-aware classifier by utilizing a pre-trained text encoder, enabling the model to flexibly detect various types of instances and previously unseen ones by simply changing prompts during inference. UniAV outperforms its single-task counterparts by a large margin with fewer parameters, achieving on-par or superior performances compared to state-of-the-art task-specific methods across ActivityNet 1.3, DESED and UnAV-100 benchmarks.

Related papers

Tracking and Segmenting Anything in Any Modality [75.32774085793498]
We propose a universal tracking and segmentation framework named SATA, which unifies a broad spectrum of tracking and segmentation subtasks with any modality input.<n> SATA demonstrates superior performance on 18 challenging tracking and segmentation benchmarks, offering a novel perspective for more generalizable video understanding.
arXiv Detail & Related papers (2025-11-22T09:09:22Z)
TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding [83.96715649130435]
We introduce TimeExpert, a Mixture-of-Experts (MoE)-based Video-LLM that effectively decomposes VTG tasks.<n>Our design choices enable precise handling of each subtask, leading to improved event modeling across diverse VTG applications.
arXiv Detail & Related papers (2025-08-03T10:03:58Z)
V$^2$Dial: Unification of Video and Visual Dialog via Multimodal Experts [44.33388344586592]
V$2$Dial is a novel expert-based model geared towards simultaneously handling image and video input data for multimodal conversational tasks. We propose to unify both tasks using a single model that for the first time jointly learns the spatial and temporal features of images and videos.
arXiv Detail & Related papers (2025-03-03T21:27:38Z)
JoVALE: Detecting Human Actions in Video Using Audiovisual and Language Contexts [8.463489896549161]
Video Action Detection (VAD) entails localizing and categorizing action instances within videos. We introduce a novel multi-modal VAD architecture, referred to as the Joint Actor-centric Visual, Audio, Language (JoVALE) JoVALE is the first VAD method to integrate audio and visual features with scene descriptive context sourced from large-capacity image captioning models.
arXiv Detail & Related papers (2024-12-18T10:51:31Z)
Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving [85.62076860189116]
Video Task Decathlon (VTD) includes ten representative image and video tasks spanning classification, segmentation, localization, and association of objects and pixels. We develop our unified network, VTDNet, that uses a single structure and a single set of weights for all ten tasks.
arXiv Detail & Related papers (2023-09-08T16:33:27Z)
Text-to-feature diffusion for audio-visual few-shot learning [59.45164042078649]
Few-shot learning from video data is a challenging and underexplored, yet much cheaper, setup. We introduce a unified audio-visual few-shot video classification benchmark on three datasets. We show that AV-DIFF obtains state-of-the-art performance on our proposed benchmark for audio-visual few-shot learning.
arXiv Detail & Related papers (2023-09-07T17:30:36Z)
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline [53.07236039168652]
We focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. We introduce the first Untrimmed Audio-Visual dataset, which contains 10K untrimmed videos with over 30K audio-visual events. Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass.
arXiv Detail & Related papers (2023-03-22T22:00:17Z)
MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos. In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z)
Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos [10.478479158063982]
We propose a novel framework to effectively leverage the audio modality in vision-specific annotated videos for action recognition. We build a semantic audio-video label dictionary (SAVLD) that maps each video label to its most K-relevant audio labels. We also present a new two-stream video Transformer for efficiently modeling the visual modalities.
arXiv Detail & Related papers (2022-03-06T17:31:06Z)
Audio-Visual Fusion Layers for Event Type Aware Video Recognition [86.22811405685681]
We propose a new model to address the multisensory integration problem with individual event-specific layers in a multi-task learning scheme. We show that our network is formulated with single labels, but it can output additional true multi-labels to represent the given videos.
arXiv Detail & Related papers (2022-02-12T02:56:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.