Audio-Visual Fusion Layers for Event Type Aware Video Recognition
- URL: http://arxiv.org/abs/2202.05961v1
- Date: Sat, 12 Feb 2022 02:56:22 GMT
- Title: Audio-Visual Fusion Layers for Event Type Aware Video Recognition
- Authors: Arda Senocak, Junsik Kim, Tae-Hyun Oh, Hyeonggon Ryu, Dingzeyu Li, In
So Kweon
- Abstract summary: We propose a new model to address the multisensory integration problem with individual event-specific layers in a multi-task learning scheme.
We show that our network is formulated with single labels, but it can output additional true multi-labels to represent the given videos.
- Score: 86.22811405685681
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human brain is continuously inundated with the multisensory information and
their complex interactions coming from the outside world at any given moment.
Such information is automatically analyzed by binding or segregating in our
brain. While this task might seem effortless for human brains, it is extremely
challenging to build a machine that can perform similar tasks since complex
interactions cannot be dealt with single type of integration but requires more
sophisticated approaches. In this paper, we propose a new model to address the
multisensory integration problem with individual event-specific layers in a
multi-task learning scheme. Unlike previous works where single type of fusion
is used, we design event-specific layers to deal with different audio-visual
relationship tasks, enabling different ways of audio-visual formation.
Experimental results show that our event-specific layers can discover unique
properties of the audio-visual relationships in the videos. Moreover, although
our network is formulated with single labels, it can output additional true
multi-labels to represent the given videos. We demonstrate that our proposed
framework also exposes the modality bias of the video data category-wise and
dataset-wise manner in popular benchmark datasets.
Related papers
- UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization [83.89550658314741]
Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL)
We present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time.
arXiv Detail & Related papers (2024-04-04T03:28:57Z) - Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant
Features [0.0]
Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics.
This paper explores privacy-compliant group-level emotion recognition ''in-the-wild'' within the EmotiW Challenge 2023.
arXiv Detail & Related papers (2023-12-06T08:58:11Z) - MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z) - Multi-level Attention Fusion Network for Audio-visual Event Recognition [6.767885381740951]
Event classification is inherently sequential and multimodal.
Deep neural models need to dynamically focus on the most relevant time window and/or modality of a video.
We propose the Multi-level Attention Fusion network (MAFnet), an architecture that can dynamically fuse visual and audio information for event recognition.
arXiv Detail & Related papers (2021-06-12T10:24:52Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video
Parsing [48.87278703876147]
A new problem, named audio-visual video parsing, aims to parse a video into temporal event segments and label them as audible, visible, or both.
We propose a novel hybrid attention network to explore unimodal and cross-modal temporal contexts simultaneously.
Experimental results show that the challenging audio-visual video parsing can be achieved even with only video-level weak labels.
arXiv Detail & Related papers (2020-07-21T01:53:31Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.