Related papers: RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba

RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba

URL: http://arxiv.org/abs/2510.16444v1
Date: Sat, 18 Oct 2025 10:41:19 GMT
Title: RefAtomNet++: Advancing Referring Atomic Video Action Recognition using Semantic Retrieval based Multi-Trajectory Mamba
Authors: Kunyu Peng, Di Wen, Jia Fu, Jiamin Wu, Kailun Yang, Junwei Zheng, Ruiping Liu, Yufan Chen, Yuqian Fu, Danda Pani Paudel, Luc Van Gool, Rainer Stiefelhagen,
Abstract summary: RefAVA++ comprises >2.9 million frames and >75.1k annotated persons in total.<n> RefAtomNet++ advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism.<n>Experiments show that RefAtomNet++ establishes new state-of-the-art results.
Score: 86.47790050206306
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring Atomic Video Action Recognition (RAVAR) aims to recognize fine-grained, atomic-level actions of a specific person of interest conditioned on natural language descriptions. Distinct from conventional action recognition and detection tasks, RAVAR emphasizes precise language-guided action understanding, which is particularly critical for interactive human action analysis in complex multi-person scenarios. In this work, we extend our previously introduced RefAVA dataset to RefAVA++, which comprises >2.9 million frames and >75.1k annotated persons in total. We benchmark this dataset using baselines from multiple related domains, including atomic action localization, video question answering, and text-video retrieval, as well as our earlier model, RefAtomNet. Although RefAtomNet surpasses other baselines by incorporating agent attention to highlight salient features, its ability to align and retrieve cross-modal information remains limited, leading to suboptimal performance in localizing the target person and predicting fine-grained actions. To overcome the aforementioned limitations, we introduce RefAtomNet++, a novel framework that advances cross-modal token aggregation through a multi-hierarchical semantic-aligned cross-attention mechanism combined with multi-trajectory Mamba modeling at the partial-keyword, scene-attribute, and holistic-sentence levels. In particular, scanning trajectories are constructed by dynamically selecting the nearest visual spatial tokens at each timestep for both partial-keyword and scene-attribute levels. Moreover, we design a multi-hierarchical semantic-aligned cross-attention strategy, enabling more effective aggregation of spatial and temporal tokens across different semantic hierarchies. Experiments show that RefAtomNet++ establishes new state-of-the-art results. The dataset and code are released at https://github.com/KPeng9510/refAVA2.

Related papers

Boosting Point-supervised Temporal Action Localization via Text Refinement and Alignment [66.80402022104074]
We propose a Text Refinement and Alignment (TRA) framework that effectively utilizes textual features from visual descriptions to complement the visual features as they are semantically rich.<n>This is achieved by designing two new modules for the original point-supervised framework: a Point-based Text Refinement module (PTR) and a Point-based Multimodal Alignment module (PMA)
arXiv Detail & Related papers (2026-02-01T14:35:46Z)
J-ORA: A Framework and Multimodal Dataset for Japanese Object Identification, Reference, Action Prediction in Robot Perception [55.8311080124569]
J-ORA is a novel dataset that bridges the gap in robot perception by providing detailed object attribute annotations.<n>It supports three critical perception tasks, object identification, reference resolution, and next-action prediction.
arXiv Detail & Related papers (2025-10-13T04:53:46Z)
Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding [11.244257545057508]
Prototype-Aware Multimodal Learning (PAML) is an innovative framework that addresses imperfect alignment between visual and linguistic modalities, insufficient cross-modal feature fusion, and ineffective utilization of semantic prototype information.<n>Our framework shows competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene.
arXiv Detail & Related papers (2025-09-08T02:27:10Z)
VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
Beyond Object Categories: Multi-Attribute Reference Understanding for Visual Grounding [10.04904999444546]
Referring expression comprehension aims at achieving object localization based on natural language descriptions.<n>Existing REC approaches are constrained by object category descriptions and single-attribute intention descriptions.<n>We propose Multi-ref EC, a novel framework that integrates state descriptions, derived intentions, and embodied gestures to locate target objects.
arXiv Detail & Related papers (2025-03-25T00:59:58Z)
ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives [109.11714588441511]
The Ego-Exo object correspondence task aims to understand object relations across ego-exo perspectives through segmentation.<n> PSALM, a recently proposed segmentation method, stands out as a notable exception with its demonstrated zero-shot ability on this task.<n>We propose ObjectRelator, a novel approach featuring two key modules: Multimodal Condition Fusion and SSL-based Cross-View Object Alignment.
arXiv Detail & Related papers (2024-11-28T12:01:03Z)
Spatio-Temporal Context Prompting for Zero-Shot Action Detection [13.22912547389941]
We propose a method which can effectively leverage the rich knowledge of visual-language models to perform Person-Context Interaction.<n>To address the challenge of recognizing distinct actions by multiple people at the same timestamp, we design the Interest Token Spotting mechanism.<n>Our method achieves superior results compared to previous approaches and can be further extended to multi-action videos.
arXiv Detail & Related papers (2024-08-28T17:59:05Z)
Referring Atomic Video Action Recognition [40.85071733730557]
We introduce a new task called Referring Atomic Video Action Recognition. We focus on recognizing the correct atomic action of a specific individual, guided by text. We present the RefAVA dataset, containing 36,630 instances with manually annotated textual descriptions.
arXiv Detail & Related papers (2024-07-02T01:13:05Z)
Bootstrapping Referring Multi-Object Tracking [27.77514740607812]
We introduce a new and general referring understanding task, termed referring multi-object tracking (RMOT)<n>Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking.<n>To efficiently generate high-quality annotations, we propose a semi-automatic labeling pipeline that formulates a total of 9,758 language prompts.
arXiv Detail & Related papers (2024-06-07T16:02:10Z)
Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation [102.25240608024063]
Referring image segments an image from a language expression. We develop an algorithm that shifts from being localization-centric to segmentation-language. Compared to its counterparts, our method is more versatile yet effective.
arXiv Detail & Related papers (2023-03-11T08:42:40Z)
Target-Aware Object Discovery and Association for Unsupervised Video Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation. We introduce a novel approach for more accurate and efficient unseen-temporal segmentation. We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.