Referring Atomic Video Action Recognition
- URL: http://arxiv.org/abs/2407.01872v2
- Date: Wed, 10 Jul 2024 23:52:05 GMT
- Title: Referring Atomic Video Action Recognition
- Authors: Kunyu Peng, Jia Fu, Kailun Yang, Di Wen, Yufan Chen, Ruiping Liu, Junwei Zheng, Jiaming Zhang, M. Saquib Sarfraz, Rainer Stiefelhagen, Alina Roitberg,
- Abstract summary: We introduce a new task called Referring Atomic Video Action Recognition.
We focus on recognizing the correct atomic action of a specific individual, guided by text.
We present the RefAVA dataset, containing 36,630 instances with manually annotated textual descriptions.
- Score: 40.85071733730557
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a new task called Referring Atomic Video Action Recognition (RAVAR), aimed at identifying atomic actions of a particular person based on a textual description and the video data of this person. This task differs from traditional action recognition and localization, where predictions are delivered for all present individuals. In contrast, we focus on recognizing the correct atomic action of a specific individual, guided by text. To explore this task, we present the RefAVA dataset, containing 36,630 instances with manually annotated textual descriptions of the individuals. To establish a strong initial benchmark, we implement and validate baselines from various domains, e.g., atomic action localization, video question answering, and text-video retrieval. Since these existing methods underperform on RAVAR, we introduce RefAtomNet -- a novel cross-stream attention-driven method specialized for the unique challenges of RAVAR: the need to interpret a textual referring expression for the targeted individual, utilize this reference to guide the spatial localization and harvest the prediction of the atomic actions for the referring person. The key ingredients are: (1) a multi-stream architecture that connects video, text, and a new location-semantic stream, and (2) cross-stream agent attention fusion and agent token fusion which amplify the most relevant information across these streams and consistently surpasses standard attention-based fusion on RAVAR. Extensive experiments demonstrate the effectiveness of RefAtomNet and its building blocks for recognizing the action of the described individual. The dataset and code will be made publicly available at https://github.com/KPeng9510/RAVAR.
Related papers
- RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba [86.47790050206306]
RefAVA++ comprises >2.9 million frames and >75.1k annotated persons in total.<n> RefAtomNet++ advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism.<n>Experiments show that RefAtomNet++ establishes new state-of-the-art results.
arXiv Detail & Related papers (2025-10-18T10:41:19Z) - Beyond Label Semantics: Language-Guided Action Anatomy for Few-shot Action Recognition [16.07037171149096]
Few-shot action recognition (FSAR) aims to classify human actions in videos with only a small number of samples labeled per category.<n>We propose Language-Guided Action Anatomy (LGA), a novel framework that goes beyond label semantics.<n>For text, we prompt an off-the-shelf LLM to anatomize labels into sequences of atomic action descriptions.<n>For videos, a Visual Anatomy Module segments actions into atomic video phases to capture the sequential structure of actions.
arXiv Detail & Related papers (2025-07-22T07:16:25Z) - Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding [10.04904999444546]
Referring expression comprehension aims at achieving object localization based on natural language descriptions.
Existing REC approaches are constrained by object category descriptions and single-attribute intention descriptions.
We propose Multi-ref EC, a novel framework that integrates state descriptions, derived intentions, and embodied gestures to locate target objects.
arXiv Detail & Related papers (2025-03-25T00:59:58Z) - Boosting Weakly-Supervised Referring Image Segmentation via Progressive Comprehension [40.21084218601082]
This paper focuses on a challenging setup where target localization is learned directly from image-text pairs.
We propose a novel Progressive Network (PCNet) to leverage target-related textual cues for progressively localizing the target object.
Our method outperforms SOTA methods on three common benchmarks.
arXiv Detail & Related papers (2024-10-02T13:30:32Z) - Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.
To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.
Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z) - JARViS: Detecting Actions in Video Using Unified Actor-Scene Context Relation Modeling [8.463489896549161]
Two-stage Video localization (VAD) is a formidable task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip.
We propose a two-stage VAD framework called Joint Actor-scene context Relation modeling (JARViS)
JARViS consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention.
arXiv Detail & Related papers (2024-08-07T08:08:08Z) - Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - Weakly Supervised Open-Vocabulary Object Detection [31.605276665964787]
We propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD.
To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment.
arXiv Detail & Related papers (2023-12-19T18:59:53Z) - Free-Form Composition Networks for Egocentric Action Recognition [97.02439848145359]
We propose a free-form composition network (FFCN) that can simultaneously learn disentangled verb, preposition, and noun representations.
The proposed FFCN can directly generate new training data samples for rare classes, hence significantly improve action recognition performance.
arXiv Detail & Related papers (2023-07-13T02:22:09Z) - What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions [55.574102714832456]
spatial-temporal grounding describes the task of localizing events in space and time.
Models for this task are usually trained with human-annotated sentences and bounding box supervision.
We combine local representation learning, which focuses on fine-grained spatial information, with a global representation that captures higher-level representations.
arXiv Detail & Related papers (2023-03-29T19:38:23Z) - Referring Multi-Object Tracking [78.63827591797124]
We propose a new and general referring understanding task, termed referring multi-object tracking (RMOT)
Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking.
To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos.
arXiv Detail & Related papers (2023-03-06T18:50:06Z) - AXM-Net: Cross-Modal Context Sharing Attention Network for Person Re-ID [20.700750237972155]
Cross-modal person re-identification (Re-ID) is critical for modern video surveillance systems.
Key challenge is to align inter-modality representations according to semantic information present for a person and ignore background information.
We present AXM-Net, a novel CNN based architecture designed for learning semantically aligned visual and textual representations.
arXiv Detail & Related papers (2021-01-19T16:06:39Z) - Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query.
We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query.
The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.