Visually-aware Acoustic Event Detection using Heterogeneous Graphs
- URL: http://arxiv.org/abs/2207.07935v1
- Date: Sat, 16 Jul 2022 13:09:25 GMT
- Title: Visually-aware Acoustic Event Detection using Heterogeneous Graphs
- Authors: Amir Shirian, Krishna Somandepalli, Victor Sanchez, Tanaya Guha
- Abstract summary: Perception of auditory events is inherently multimodal relying on both audio and visual cues.
We employ heterogeneous graphs to capture the spatial and temporal relationships between the modalities.
We show efficiently modelling of intra- and inter-modality relationships both at spatial and temporal scales.
- Score: 39.90352230010103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Perception of auditory events is inherently multimodal relying on both audio
and visual cues. A large number of existing multimodal approaches process each
modality using modality-specific models and then fuse the embeddings to encode
the joint information. In contrast, we employ heterogeneous graphs to
explicitly capture the spatial and temporal relationships between the
modalities and represent detailed information about the underlying signal.
Using heterogeneous graph approaches to address the task of visually-aware
acoustic event classification, which serves as a compact, efficient and
scalable way to represent data in the form of graphs. Through heterogeneous
graphs, we show efficiently modelling of intra- and inter-modality
relationships both at spatial and temporal scales. Our model can easily be
adapted to different scales of events through relevant hyperparameters.
Experiments on AudioSet, a large benchmark, shows that our model achieves
state-of-the-art performance.
Related papers
- Graph-Dictionary Signal Model for Sparse Representations of Multivariate Data [49.77103348208835]
We define a novel Graph-Dictionary signal model, where a finite set of graphs characterizes relationships in data distribution through a weighted sum of their Laplacians.
We propose a framework to infer the graph dictionary representation from observed data, along with a bilinear generalization of the primal-dual splitting algorithm to solve the learning problem.
We exploit graph-dictionary representations in a motor imagery decoding task on brain activity data, where we classify imagined motion better than standard methods.
arXiv Detail & Related papers (2024-11-08T17:40:43Z) - TimeGraphs: Graph-based Temporal Reasoning [64.18083371645956]
TimeGraphs is a novel approach that characterizes dynamic interactions as a hierarchical temporal graph.
Our approach models the interactions using a compact graph-based representation, enabling adaptive reasoning across diverse time scales.
We evaluate TimeGraphs on multiple datasets with complex, dynamic agent interactions, including a football simulator, the Resistance game, and the MOMA human activity dataset.
arXiv Detail & Related papers (2024-01-06T06:26:49Z) - Unified and Dynamic Graph for Temporal Character Grouping in Long Videos [31.192044026127032]
Video temporal character grouping locates appearing moments of major characters within a video according to their identities.
Recent works have evolved from unsupervised clustering to graph-based supervised clustering.
We present a unified and dynamic graph (UniDG) framework for temporal character grouping.
arXiv Detail & Related papers (2023-08-27T13:22:55Z) - Heterogeneous Graph Learning for Acoustic Event Classification [22.526665796655205]
Graphs for audiovisual data are constructed manually which is difficult and sub-optimal.
We develop a new model, heterogeneous graph crossmodal network (HGCN) that learns the crossmodal edges.
Our proposed model can adapt to various spatial and temporal scales owing to its parametric construction, while the learnable crossmodal edges effectively connect the relevant nodes.
arXiv Detail & Related papers (2023-03-05T13:06:53Z) - DyTed: Disentangled Representation Learning for Discrete-time Dynamic
Graph [59.583555454424]
We propose a novel disenTangled representation learning framework for discrete-time Dynamic graphs, namely DyTed.
We specially design a temporal-clips contrastive learning task together with a structure contrastive learning to effectively identify the time-invariant and time-varying representations respectively.
arXiv Detail & Related papers (2022-10-19T14:34:12Z) - Representing Videos as Discriminative Sub-graphs for Action Recognition [165.54738402505194]
We introduce a new design of sub-graphs to represent and encode theriminative patterns of each action in the videos.
We present MUlti-scale Sub-Earn Ling (MUSLE) framework that novelly builds space-time graphs and clusters into compact sub-graphs on each scale.
arXiv Detail & Related papers (2022-01-11T16:15:25Z) - Learning Spatial-Temporal Graphs for Active Speaker Detection [26.45877018368872]
SPELL is a framework that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data.
We first construct a graph from a video so that each node corresponds to one person.
We demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance.
arXiv Detail & Related papers (2021-12-02T18:29:07Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - Graph Pattern Loss based Diversified Attention Network for Cross-Modal
Retrieval [10.420129873840578]
Cross-modal retrieval aims to enable flexible retrieval experience by combining multimedia data such as image, video, text, and audio.
One core of unsupervised approaches is to dig the correlations among different object representations to complete satisfied retrieval performance without requiring expensive labels.
We propose a Graph Pattern Loss based Diversified Attention Network(GPLDAN) for unsupervised cross-modal retrieval.
arXiv Detail & Related papers (2021-06-25T10:53:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.