Learning Spatial-Temporal Graphs for Active Speaker Detection
- URL: http://arxiv.org/abs/2112.01479v2
- Date: Fri, 3 Dec 2021 19:41:06 GMT
- Title: Learning Spatial-Temporal Graphs for Active Speaker Detection
- Authors: Sourya Roy, Kyle Min, Subarna Tripathi, Tanaya Guha and Somdeb
Majumdar
- Abstract summary: SPELL is a framework that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data.
We first construct a graph from a video so that each node corresponds to one person.
We demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance.
- Score: 26.45877018368872
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the problem of active speaker detection through a new framework,
called SPELL, that learns long-range multimodal graphs to encode the
inter-modal relationship between audio and visual data. We cast active speaker
detection as a node classification task that is aware of longer-term
dependencies. We first construct a graph from a video so that each node
corresponds to one person. Nodes representing the same identity share edges
between them within a defined temporal window. Nodes within the same video
frame are also connected to encode inter-person interactions. Through extensive
experiments on the Ava-ActiveSpeaker dataset, we demonstrate that learning
graph-based representation, owing to its explicit spatial and temporal
structure, significantly improves the overall performance. SPELL outperforms
several relevant baselines and performs at par with state of the art models
while requiring an order of magnitude lower computation cost.
Related papers
- VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - Visually-aware Acoustic Event Detection using Heterogeneous Graphs [39.90352230010103]
Perception of auditory events is inherently multimodal relying on both audio and visual cues.
We employ heterogeneous graphs to capture the spatial and temporal relationships between the modalities.
We show efficiently modelling of intra- and inter-modality relationships both at spatial and temporal scales.
arXiv Detail & Related papers (2022-07-16T13:09:25Z) - Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection [21.512786675773675]
Active speaker detection in videos with multiple speakers is a challenging task.
We present SPELL, a novel spatial-temporal graph learning framework.
SPELL is able to reason over long temporal contexts for all nodes without relying on computationally expensive fully connected graph neural networks.
arXiv Detail & Related papers (2022-07-15T23:43:17Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - Modelling Neighbor Relation in Joint Space-Time Graph for Video
Correspondence Learning [53.74240452117145]
This paper presents a self-supervised method for learning reliable visual correspondence from unlabeled videos.
We formulate the correspondence as finding paths in a joint space-time graph, where nodes are grid patches sampled from frames, and are linked by two types of edges.
Our learned representation outperforms the state-of-the-art self-supervised methods on a variety of visual tasks.
arXiv Detail & Related papers (2021-09-28T05:40:01Z) - Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.