Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection
- URL: http://arxiv.org/abs/2207.07783v2
- Date: Tue, 19 Jul 2022 01:30:35 GMT
- Title: Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection
- Authors: Kyle Min, Sourya Roy, Subarna Tripathi, Tanaya Guha, Somdeb Majumdar
- Abstract summary: Active speaker detection in videos with multiple speakers is a challenging task.
We present SPELL, a novel spatial-temporal graph learning framework.
SPELL is able to reason over long temporal contexts for all nodes without relying on computationally expensive fully connected graph neural networks.
- Score: 21.512786675773675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Active speaker detection (ASD) in videos with multiple speakers is a
challenging task as it requires learning effective audiovisual features and
spatial-temporal correlations over long temporal windows. In this paper, we
present SPELL, a novel spatial-temporal graph learning framework that can solve
complex tasks such as ASD. To this end, each person in a video frame is first
encoded in a unique node for that frame. Nodes corresponding to a single person
across frames are connected to encode their temporal dynamics. Nodes within a
frame are also connected to encode inter-person relationships. Thus, SPELL
reduces ASD to a node classification task. Importantly, SPELL is able to reason
over long temporal contexts for all nodes without relying on computationally
expensive fully connected graph neural networks. Through extensive experiments
on the AVA-ActiveSpeaker dataset, we demonstrate that learning graph-based
representations can significantly improve the active speaker detection
performance owing to its explicit spatial and temporal structure. SPELL
outperforms all previous state-of-the-art approaches while requiring
significantly lower memory and computational resources. Our code is publicly
available at https://github.com/SRA2/SPELL
Related papers
- Unified Static and Dynamic Network: Efficient Temporal Filtering for Video Grounding [56.315932539150324]
We design a Unified Static and Dynamic Network (UniSDNet) to learn the semantic association between the video and text/audio queries.
Our UniSDNet is applicable to both Natural Language Video Grounding (NLVG) and Spoken Language Video Grounding (SLVG) tasks.
arXiv Detail & Related papers (2024-03-21T06:53:40Z) - Temporal Aggregation and Propagation Graph Neural Networks for Dynamic
Representation [67.26422477327179]
Temporal graphs exhibit dynamic interactions between nodes over continuous time.
We propose a novel method of temporal graph convolution with the whole neighborhood.
Our proposed TAP-GNN outperforms existing temporal graph methods by a large margin in terms of both predictive performance and online inference latency.
arXiv Detail & Related papers (2023-04-15T08:17:18Z) - End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations [13.020158123538138]
Speech separation guided diarization (SSGD) performs diarization by first separating the speakers and then applying voice activity detection (VAD) on each separated stream.
We consider three state-of-the-art speech separation (SSep) algorithms and study their performance in online and offline scenarios.
We show that our best model achieves 8.8% DER on CALLHOME, which outperforms the current state-of-the-art end-to-end neural diarization model.
arXiv Detail & Related papers (2023-03-21T16:33:56Z) - TPGNN: Learning High-order Information in Dynamic Graphs via Temporal
Propagation [7.616789069832552]
We propose a temporal propagation-based graph neural network, namely TPGNN.
Propagator propagates messages from anchor node to temporal neighbors within $k$-hop, and then simultaneously update the state of neighborhoods.
To prevent over-smoothing, the model compels the messages from $n$-hop neighbors to update the $n$-hop memory vector preserved on the anchor.
arXiv Detail & Related papers (2022-10-03T18:39:07Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Learning Spatial-Temporal Graphs for Active Speaker Detection [26.45877018368872]
SPELL is a framework that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data.
We first construct a graph from a video so that each node corresponds to one person.
We demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance.
arXiv Detail & Related papers (2021-12-02T18:29:07Z) - ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction
Detection in Videos [91.29436920371003]
We propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI)
We use temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features.
We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.
arXiv Detail & Related papers (2021-05-25T07:54:35Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.