Related papers: Visually-aware Acoustic Event Detection using Heterogeneous Graphs

Visually-aware Acoustic Event Detection using Heterogeneous Graphs

URL: http://arxiv.org/abs/2207.07935v1
Date: Sat, 16 Jul 2022 13:09:25 GMT
Title: Visually-aware Acoustic Event Detection using Heterogeneous Graphs
Authors: Amir Shirian, Krishna Somandepalli, Victor Sanchez, Tanaya Guha
Abstract summary: Perception of auditory events is inherently multimodal relying on both audio and visual cues. We employ heterogeneous graphs to capture the spatial and temporal relationships between the modalities. We show efficiently modelling of intra- and inter-modality relationships both at spatial and temporal scales.
Score: 39.90352230010103
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Perception of auditory events is inherently multimodal relying on both audio and visual cues. A large number of existing multimodal approaches process each modality using modality-specific models and then fuse the embeddings to encode the joint information. In contrast, we employ heterogeneous graphs to explicitly capture the spatial and temporal relationships between the modalities and represent detailed information about the underlying signal. Using heterogeneous graph approaches to address the task of visually-aware acoustic event classification, which serves as a compact, efficient and scalable way to represent data in the form of graphs. Through heterogeneous graphs, we show efficiently modelling of intra- and inter-modality relationships both at spatial and temporal scales. Our model can easily be adapted to different scales of events through relevant hyperparameters. Experiments on AudioSet, a large benchmark, shows that our model achieves state-of-the-art performance.

Related papers

GSDNet: Revisiting Incomplete Multimodal-Diffusion from Graph Spectrum Perspective for Conversation Emotion Recognition [26.41302797345201]
Multimodal emotion recognition in conversations aims to infer the speaker's emotional state by analyzing utterance information from multiple sources.<n>The modality missing problem severely limits the performance of MERC in practical scenarios.<n>We propose a novel Graph Spectral Diffusion Network (GSDNet) which maps Gaussian noise to the graph spectral space of missing modalities and recovers the missing data according to its original distribution.
arXiv Detail & Related papers (2025-06-14T03:24:19Z)
Graph-Dictionary Signal Model for Sparse Representations of Multivariate Data [49.77103348208835]
We define a novel Graph-Dictionary signal model, where a finite set of graphs characterizes relationships in data distribution through a weighted sum of their Laplacians. We propose a framework to infer the graph dictionary representation from observed data, along with a bilinear generalization of the primal-dual splitting algorithm to solve the learning problem. We exploit graph-dictionary representations in a motor imagery decoding task on brain activity data, where we classify imagined motion better than standard methods.
arXiv Detail & Related papers (2024-11-08T17:40:43Z)
TimeGraphs: Graph-based Temporal Reasoning [64.18083371645956]
TimeGraphs is a novel approach that characterizes dynamic interactions as a hierarchical temporal graph. Our approach models the interactions using a compact graph-based representation, enabling adaptive reasoning across diverse time scales. We evaluate TimeGraphs on multiple datasets with complex, dynamic agent interactions, including a football simulator, the Resistance game, and the MOMA human activity dataset.
arXiv Detail & Related papers (2024-01-06T06:26:49Z)
Unified and Dynamic Graph for Temporal Character Grouping in Long Videos [31.192044026127032]
Video temporal character grouping locates appearing moments of major characters within a video according to their identities. Recent works have evolved from unsupervised clustering to graph-based supervised clustering. We present a unified and dynamic graph (UniDG) framework for temporal character grouping.
arXiv Detail & Related papers (2023-08-27T13:22:55Z)
Heterogeneous Graph Learning for Acoustic Event Classification [22.526665796655205]
Graphs for audiovisual data are constructed manually which is difficult and sub-optimal. We develop a new model, heterogeneous graph crossmodal network (HGCN) that learns the crossmodal edges. Our proposed model can adapt to various spatial and temporal scales owing to its parametric construction, while the learnable crossmodal edges effectively connect the relevant nodes.
arXiv Detail & Related papers (2023-03-05T13:06:53Z)
DyTed: Disentangled Representation Learning for Discrete-time Dynamic Graph [59.583555454424]
We propose a novel disenTangled representation learning framework for discrete-time Dynamic graphs, namely DyTed. We specially design a temporal-clips contrastive learning task together with a structure contrastive learning to effectively identify the time-invariant and time-varying representations respectively.
arXiv Detail & Related papers (2022-10-19T14:34:12Z)
Representing Videos as Discriminative Sub-graphs for Action Recognition [165.54738402505194]
We introduce a new design of sub-graphs to represent and encode theriminative patterns of each action in the videos. We present MUlti-scale Sub-Earn Ling (MUSLE) framework that novelly builds space-time graphs and clusters into compact sub-graphs on each scale.
arXiv Detail & Related papers (2022-01-11T16:15:25Z)
Learning Spatial-Temporal Graphs for Active Speaker Detection [26.45877018368872]
SPELL is a framework that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data. We first construct a graph from a video so that each node corresponds to one person. We demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance.
arXiv Detail & Related papers (2021-12-02T18:29:07Z)
Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks. We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z)
Graph Pattern Loss based Diversified Attention Network for Cross-Modal Retrieval [10.420129873840578]
Cross-modal retrieval aims to enable flexible retrieval experience by combining multimedia data such as image, video, text, and audio. One core of unsupervised approaches is to dig the correlations among different object representations to complete satisfied retrieval performance without requiring expensive labels. We propose a Graph Pattern Loss based Diversified Attention Network(GPLDAN) for unsupervised cross-modal retrieval.
arXiv Detail & Related papers (2021-06-25T10:53:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.