MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual
Event Localization and Video Parsing
- URL: http://arxiv.org/abs/2111.12374v1
- Date: Wed, 24 Nov 2021 09:47:26 GMT
- Title: MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual
Event Localization and Video Parsing
- Authors: Jiashuo Yu, Ying Cheng, Rui-Wei Zhao, Rui Feng, Yuejie Zhang
- Abstract summary: We present a Multimodal Pyramid Attentional Network (MM-Pyramid) that captures and integrates multi-level temporal features for audio-visual event localization and audio-visual video parsing.
We also design an adaptive semantic fusion module, which leverages a unit-level attention block and a selective fusion block to integrate pyramid features interactively.
- Score: 7.977954561853929
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recognizing and localizing events in videos is a fundamental task for video
understanding. Since events may occur in auditory and visual modalities,
multimodal detailed perception is essential for complete scene comprehension.
Most previous works attempted to analyze videos from a holistic perspective.
However, they do not consider semantic information at multiple scales, which
makes the model difficult to localize events in various lengths. In this paper,
we present a Multimodal Pyramid Attentional Network (MM-Pyramid) that captures
and integrates multi-level temporal features for audio-visual event
localization and audio-visual video parsing. Specifically, we first propose the
attentive feature pyramid module. This module captures temporal pyramid
features via several stacking pyramid units, each of them is composed of a
fixed-size attention block and dilated convolution block. We also design an
adaptive semantic fusion module, which leverages a unit-level attention block
and a selective fusion block to integrate pyramid features interactively.
Extensive experiments on audio-visual event localization and weakly-supervised
audio-visual video parsing tasks verify the effectiveness of our approach.
Related papers
- Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization [50.122441710500055]
Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video.
Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint.
We present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE.
arXiv Detail & Related papers (2024-09-12T11:54:25Z) - Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding [33.85362137961572]
We introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,000 pseudo-untrimmed videos with detailed temporal annotations.
PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering.
We develop AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens.
arXiv Detail & Related papers (2024-03-24T19:50:49Z) - Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities [67.89368528234394]
One of the main challenges of multimodal learning is the need to combine heterogeneous modalities.
Video and audio are obtained at much higher rates than text and are roughly aligned in time.
Our approach achieves the state-of-the-art on well established multimodal benchmarks, outperforming much larger models.
arXiv Detail & Related papers (2023-11-09T19:15:12Z) - Fine-grained Audio-Visual Joint Representations for Multimodal Large
Language Models [25.660343393359565]
This paper proposes a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal large language models (LLM)
FAVOR simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level.
An interactive demo of FAVOR is available at https://github.com/BriansIDP/AudioVisualLLM.git, and the training code and model checkpoints will be released soon.
arXiv Detail & Related papers (2023-10-09T17:00:20Z) - Multi-Scale Attention for Audio Question Answering [9.254814692650523]
Audio question answering (AQA) acting as a widely used proxy task to explore scene understanding.
Existing methods mostly extend the structures of visual question answering task to audio ones in a simple pattern.
We present a Multi-scale Window Attention Fusion Model (MWAFM) consisting of an asynchronous hybrid attention module and a multi-scale window attention module.
arXiv Detail & Related papers (2023-05-29T10:06:58Z) - Efficient End-to-End Video Question Answering with Pyramidal Multimodal
Transformer [13.71165050314854]
We present a new method for end-to-end Video Questioning (VideoQA)
We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer.
We demonstrate better or on-par performances with high computational efficiency against state-the-art methods on five VideoQA benchmarks.
arXiv Detail & Related papers (2023-02-04T09:14:18Z) - Temporal Pyramid Transformer with Multimodal Interaction for Video
Question Answering [13.805714443766236]
Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language understanding.
This paper proposes a novel Temporal Pyramid Transformer (TPT) model with multimodal interaction for VideoQA.
arXiv Detail & Related papers (2021-09-10T08:31:58Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video
Parsing [48.87278703876147]
A new problem, named audio-visual video parsing, aims to parse a video into temporal event segments and label them as audible, visible, or both.
We propose a novel hybrid attention network to explore unimodal and cross-modal temporal contexts simultaneously.
Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels.
arXiv Detail & Related papers (2020-07-21T01:53:31Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.