Related papers: Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization

Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization

URL: http://arxiv.org/abs/2210.05242v2
Date: Fri, 20 Oct 2023 08:48:11 GMT
Title: Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization
Authors: Yuanyuan Jiang, Jianqin Yin, Yonghao Dang
Abstract summary: We propose a novel video-level semantic consistency guidance network for the AVE localization task. It consists of two components: a cross-modal event representation extractor and an intra-modal semantic consistency enhancer. We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings.
Score: 8.530561069113716
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio-visual event (AVE) localization has attracted much attention in recent years. Most existing methods are often limited to independently encoding and classifying each video segment separated from the full video (which can be regarded as the segment-level representations of events). However, they ignore the semantic consistency of the event within the same full video (which can be considered as the video-level representations of events). In contrast to existing methods, we propose a novel video-level semantic consistency guidance network for the AVE localization task. Specifically, we propose an event semantic consistency modeling (ESCM) module to explore video-level semantic information for semantic consistency modeling. It consists of two components: a cross-modal event representation extractor (CERE) and an intra-modal semantic consistency enhancer (ISCE). CERE is proposed to obtain the event semantic information at the video level. Furthermore, ISCE takes video-level event semantics as prior knowledge to guide the model to focus on the semantic continuity of an event within each modality. Moreover, we propose a new negative pair filter loss to encourage the network to filter out the irrelevant segment pairs and a new smooth loss to further increase the gap between different categories of events in the weakly-supervised setting. We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings, thus verifying the effectiveness of our method.The code is available at https://github.com/Bravo5542/VSCG.

Related papers

ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization [14.920403124245867]
We introduce multi-stage semantic guidance and multi-event relationship modeling.<n>This enables hierarchical semantic understanding of audio-visual events and adaptive extraction of event dependencies.<n>Our method significantly surpasses the state-of-the-art methods, while greatly reducing parameters and computational load.
arXiv Detail & Related papers (2025-07-14T05:42:00Z)
Dense Video Captioning using Graph-based Sentence Summarization [80.52481563888459]
We propose a graph-based partition-and-summarization framework for dense video captioning.<n>We focus on the summarization" stage, and propose a framework that effectively exploits the relationship between semantic words for summarization.
arXiv Detail & Related papers (2025-06-25T16:23:43Z)
Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing [22.655045848201528]
Capturing accurate event semantics for each audio/visual segment is vital. Each segment may contain multiple events, resulting in semantically mixed holistic features. We propose a Fine-Grained Semantic Enhancement module for encoding intra- and cross-modal relations.
arXiv Detail & Related papers (2024-12-15T16:54:53Z)
Towards Open-Vocabulary Audio-Visual Event Localization [59.23161248808759]
We introduce the Open-Vocabulary Audio-Visual Event localization problem. This problem requires localizing audio-visual events and predicting explicit categories for both seen and unseen data at inference. We propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audio-visual scenes.
arXiv Detail & Related papers (2024-11-18T04:35:20Z)
Generating Event-oriented Attribution for Movies via Two-Stage Prefix-Enhanced Multimodal LLM [47.786978666537436]
We propose a Two-Stage Prefix-Enhanced MLLM (TSPE) approach for event attribution in movie videos. In the local stage, we introduce an interaction-aware prefix that guides the model to focus on the relevant multimodal information within a single clip. In the global stage, we strengthen the connections between associated events using an inferential knowledge graph.
arXiv Detail & Related papers (2024-09-14T08:30:59Z)
CACE-Net: Co-guidance Attention and Contrastive Enhancement for Effective Audio-Visual Event Localization [11.525177542345215]
We introduce CACE-Net, which differs from most existing methods that solely use audio signals to guide visual information. We propose an audio-visual co-guidance attention mechanism that allows for adaptive bi-directional cross-modal attentional guidance. Experiments on the AVE dataset demonstrate that CACE-Net sets a new benchmark in the audio-visual event localization task.
arXiv Detail & Related papers (2024-08-04T07:48:12Z)
Label-anticipated Event Disentanglement for Audio-Visual Video Parsing [61.08434062821899]
We introduce a new decoding paradigm, underlinelabel sunderlineemunderlineantic-based underlineprojection (LEAP) LEAP works by iteratively projecting encoded latent features of audio/visual segments onto semantically independent label embeddings. To facilitate the LEAP paradigm, we propose a semantic-aware optimization strategy, which includes a novel audio-visual semantic similarity loss function.
arXiv Detail & Related papers (2024-07-11T01:57:08Z)
Event-aware Video Corpus Moment Retrieval [79.48249428428802]
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos. Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos. We propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval.
arXiv Detail & Related papers (2024-02-21T06:55:20Z)
Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events. Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments. Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z)
Multi-Modulation Network for Audio-Visual Event Localization [138.14529518908736]
We study the problem of localizing audio-visual events that are both audible and visible in a video. Existing works focus on encoding and aligning audio and visual features at the segment level. We propose a novel MultiModulation Network (M2N) to learn the above correlation and leverage it as semantic guidance.
arXiv Detail & Related papers (2021-08-26T13:11:48Z)
EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content. First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events. Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.