REACT: Recognize Every Action Everywhere All At Once
- URL: http://arxiv.org/abs/2312.00188v1
- Date: Mon, 27 Nov 2023 20:48:54 GMT
- Title: REACT: Recognize Every Action Everywhere All At Once
- Authors: Naga VS Raviteja Chappa, Pha Nguyen, Page Daniel Dobbs and Khoa Luu
- Abstract summary: Group Activity Decoder (GAR) is a fundamental problem in computer vision, with diverse applications in sports analysis, surveillance, and social scene understanding.
We present REACT, an architecture inspired by the transformer encoder-decoder model.
Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities.
- Score: 8.10024991952397
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Group Activity Recognition (GAR) is a fundamental problem in computer vision,
with diverse applications in sports video analysis, video surveillance, and
social scene understanding. Unlike conventional action recognition, GAR aims to
classify the actions of a group of individuals as a whole, requiring a deep
understanding of their interactions and spatiotemporal relationships. To
address the challenges in GAR, we present REACT (\textbf{R}ecognize
\textbf{E}very \textbf{Act}ion Everywhere All At Once), a novel architecture
inspired by the transformer encoder-decoder model explicitly designed to model
complex contextual relationships within videos, including multi-modality and
spatio-temporal features. Our architecture features a cutting-edge
Vision-Language Encoder block for integrated temporal, spatial, and multi-modal
interaction modeling. This component efficiently encodes spatiotemporal
interactions, even with sparsely sampled frames, and recovers essential local
information. Our Action Decoder Block refines the joint understanding of text
and video data, allowing us to precisely retrieve bounding boxes, enhancing the
link between semantics and visual reality. At the core, our Actor Fusion Block
orchestrates a fusion of actor-specific data and textual features, striking a
balance between specificity and context. Our method outperforms
state-of-the-art GAR approaches in extensive experiments, demonstrating
superior accuracy in recognizing and understanding group activities. Our
architecture's potential extends to diverse real-world applications, offering
empirical evidence of its performance gains. This work significantly advances
the field of group activity recognition, providing a robust framework for
nuanced scene comprehension.
Related papers
- Unified Framework with Consistency across Modalities for Human Activity Recognition [14.639249548669756]
We propose a comprehensive framework for robust video-based human activity recognition.
Key contribution is the introduction of a novel query machine, called COMPUTER.
Our approach demonstrates superior performance when compared with state-of-the-art methods.
arXiv Detail & Related papers (2024-09-04T02:25:10Z) - Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph
Generation [64.85974098314344]
Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video.
Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images.
We propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism.
arXiv Detail & Related papers (2023-09-23T02:40:28Z) - Cross-Video Contextual Knowledge Exploration and Exploitation for
Ambiguity Reduction in Weakly Supervised Temporal Action Localization [23.94629999419033]
Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels.
Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset.
Our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.
arXiv Detail & Related papers (2023-08-24T07:19:59Z) - SOC: Semantic-Assisted Object Cluster for Referring Video Object
Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment.
We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment.
We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z) - ArK: Augmented Reality with Knowledge Interactive Emergent Ability [115.72679420999535]
We develop an infinite agent that learns to transfer knowledge memory from general foundation models to novel domains.
The heart of our approach is an emerging mechanism, dubbed Augmented Reality with Knowledge Inference Interaction (ArK)
We show that our ArK approach, combined with large foundation models, significantly improves the quality of generated 2D/3D scenes.
arXiv Detail & Related papers (2023-05-01T17:57:01Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - COMPOSER: Compositional Learning of Group Activity in Videos [33.526331969279106]
Group Activity Recognition (GAR) detects the activity performed by a group of actors in a short video clip.
We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale.
COMPOSER achieves a new state-of-the-art 94.5% accuracy with the keypoint-only modality.
arXiv Detail & Related papers (2021-12-11T01:25:46Z) - Multi-Granularity Reference-Aided Attentive Feature Aggregation for
Video-based Person Re-identification [98.7585431239291]
Video-based person re-identification aims at matching the same person across video clips.
In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-Attentive Feature aggregation module MG-RAFA.
Our framework achieves the state-of-the-art ablation performance on three benchmark datasets.
arXiv Detail & Related papers (2020-03-27T03:49:21Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z) - See More, Know More: Unsupervised Video Object Segmentation with
Co-Attention Siamese Networks [184.4379622593225]
We introduce a novel network, called CO-attention Siamese Network (COSNet), to address the unsupervised video object segmentation task.
We emphasize the importance of inherent correlation among video frames and incorporate a global co-attention mechanism.
We propose a unified and end-to-end trainable framework where different co-attention variants can be derived for mining the rich context within videos.
arXiv Detail & Related papers (2020-01-19T11:10:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.