Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing
- URL: http://arxiv.org/abs/2412.11248v2
- Date: Tue, 17 Dec 2024 07:31:27 GMT
- Title: Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing
- Authors: Pengcheng Zhao, Jinxing Zhou, Yang Zhao, Dan Guo, Yanxiang Chen,
- Abstract summary: Capturing accurate event semantics for each audio/visual segment is vital.
Each segment may contain multiple events, resulting in semantically mixed holistic features.
We propose a Fine-Grained Semantic Enhancement module for encoding intra- and cross-modal relations.
- Score: 22.655045848201528
- License:
- Abstract: The Audio-Visual Video Parsing task aims to recognize and temporally localize all events occurring in either the audio or visual stream, or both. Capturing accurate event semantics for each audio/visual segment is vital. Prior works directly utilize the extracted holistic audio and visual features for intra- and cross-modal temporal interactions. However, each segment may contain multiple events, resulting in semantically mixed holistic features that can lead to semantic interference during intra- or cross-modal interactions: the event semantics of one segment may incorporate semantics of unrelated events from other segments. To address this issue, our method begins with a Class-Aware Feature Decoupling (CAFD) module, which explicitly decouples the semantically mixed features into distinct class-wise features, including multiple event-specific features and a dedicated background feature. The decoupled class-wise features enable our model to selectively aggregate useful semantics for each segment from clearly matched classes contained in other segments, preventing semantic interference from irrelevant classes. Specifically, we further design a Fine-Grained Semantic Enhancement module for encoding intra- and cross-modal relations. It comprises a Segment-wise Event Co-occurrence Modeling (SECM) block and a Local-Global Semantic Fusion (LGSF) block. The SECM exploits inter-class dependencies of concurrent events within the same timestamp with the aid of a new event co-occurrence loss. The LGSF further enhances the event semantics of each segment by incorporating relevant semantics from more informative global video features. Extensive experiments validate the effectiveness of the proposed modules and loss functions, resulting in a new state-of-the-art parsing performance.
Related papers
- Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration [48.57159286673662]
This paper aims to advance audio-visual scene understanding for longer, untrimmed videos.
We introduce a novel CCNet, comprising two core modules: the Cross-Modal Consistency Collaboration and the Multi-Temporal Granularity Collaboration.
Experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization.
arXiv Detail & Related papers (2024-12-17T07:43:36Z) - CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization [11.525177542345215]
We introduce CACE-Net, which differs from most existing methods that solely use audio signals to guide visual information.
We propose an audio-visual co-guidance attention mechanism that allows for adaptive bi-directional cross-modal attentional guidance.
Experiments on the AVE dataset demonstrate that CACE-Net sets a new benchmark in the audio-visual event localization task.
arXiv Detail & Related papers (2024-08-04T07:48:12Z) - Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP)
LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings.
To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z) - CM-PIE: Cross-modal perception for interactive-enhanced audio-visual
video parsing [23.85763377992709]
We propose a novel interactive-enhanced cross-modal perception method(CM-PIE), which can learn fine-grained features by applying a segment-based attention module.
We show that our model offers improved parsing performance on the Look, Listen, and Parse dataset.
arXiv Detail & Related papers (2023-10-11T14:15:25Z) - Segment Everything Everywhere All at Once [124.90835636901096]
We present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image.
We propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks.
We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks.
arXiv Detail & Related papers (2023-04-13T17:59:40Z) - Leveraging the Video-level Semantic Consistency of Event for
Audio-visual Event Localization [8.530561069113716]
We propose a novel video-level semantic consistency guidance network for the AVE localization task.
It consists of two components: a cross-modal event representation extractor and an intra-modal semantic consistency enhancer.
We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-10-11T08:15:57Z) - Multi-Modulation Network for Audio-Visual Event Localization [138.14529518908736]
We study the problem of localizing audio-visual events that are both audible and visible in a video.
Existing works focus on encoding and aligning audio and visual features at the segment level.
We propose a novel MultiModulation Network (M2N) to learn the above correlation and leverage it as semantic guidance.
arXiv Detail & Related papers (2021-08-26T13:11:48Z) - CTNet: Context-based Tandem Network for Semantic Segmentation [77.4337867789772]
This work proposes a novel Context-based Tandem Network (CTNet) by interactively exploring the spatial contextual information and the channel contextual information.
To further improve the performance of the learned representations for semantic segmentation, the results of the two context modules are adaptively integrated.
arXiv Detail & Related papers (2021-04-20T07:33:11Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.