Related papers: Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition

Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition

URL: http://arxiv.org/abs/2504.10079v2
Date: Sat, 09 Aug 2025 15:05:25 GMT
Title: Hierarchical Relation-augmented Representation Generalization for Few-shot Action Recognition
Authors: Hongyu Qu, Ling Xing, Jiachao Zhang, Rui Yan, Yazhou Yao, Xiangbo Shu,
Abstract summary: Few-shot action recognition aims to recognize novel action categories with few exemplars.<n>Existing methods typically learn frame-level representations for each video by designing inter-frame temporal modeling strategies.<n>We propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR.
Score: 43.84348967231349
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Few-shot action recognition (FSAR) aims to recognize novel action categories with few exemplars. Existing methods typically learn frame-level representations for each video by designing inter-frame temporal modeling strategies or inter-video interaction at the coarse video-level granularity. However, they treat each episode task in isolation and neglect fine-grained temporal relation modeling between videos, thus failing to capture shared fine-grained temporal patterns across videos and reuse temporal knowledge from historical tasks. In light of this, we propose HR2G-shot, a Hierarchical Relation-augmented Representation Generalization framework for FSAR, which unifies three types of relation modeling (inter-frame, inter-video, and inter-task) to learn task-specific temporal patterns from a holistic view. Going beyond conducting inter-frame temporal interactions, we further devise two components to respectively explore inter-video and inter-task relationships: i) Inter-video Semantic Correlation (ISC) performs cross-video frame-level interactions in a fine-grained manner, thereby capturing task-specific query features and enhancing both intra-class consistency and inter-class separability; ii) Inter-task Knowledge Transfer (IKT) retrieves and aggregates relevant temporal knowledge from the bank, which stores diverse temporal patterns from historical episode tasks. Extensive experiments on five benchmarks show that HR2G-shot outperforms current top-leading FSAR methods.

Related papers

Enhanced Partially Relevant Video Retrieval through Inter- and Intra-Sample Analysis with Coherence Prediction [18.24629930062925]
Partially Relevant Video Retrieval aims to retrieve the target video that is partially relevant to a text query.<n>Existing methods coarsely align paired videos and text queries to construct the semantic space.<n>We propose a novel PRVR framework to systematically exploit inter-sample correlation and intra-sample redundancy.
arXiv Detail & Related papers (2025-04-28T09:52:46Z)
AsyReC: A Multimodal Graph-based Framework for Spatio-Temporal Asymmetric Dyadic Relationship Classification [8.516886985159928]
Dyadic social relationships are shaped by shared spatial and temporal experiences.<n>Current computational methods for modeling these relationships face three major challenges.<n>We propose AsyReC, a multimodal graph-based framework for asymmetric dyadic relationship classification.
arXiv Detail & Related papers (2025-04-07T12:52:23Z)
DreamRelation: Relation-Centric Video Customization [33.65405972817795]
Video customization refers to the creation of personalized videos that depict user-specified relations between two subjects.<n>While existing methods can personalize subject appearances and motions, they still struggle with complex video customization.<n>We propose DreamRelation, a novel approach capturing a small set of videos, leveraging two key components: Decoupling Learning and Dynamics Enhancement.
arXiv Detail & Related papers (2025-03-10T17:58:03Z)
VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos. Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries. We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z)
REACT: Recognize Every Action Everywhere All At Once [8.10024991952397]
Group Activity Decoder (GAR) is a fundamental problem in computer vision, with diverse applications in sports analysis, surveillance, and social scene understanding. We present REACT, an architecture inspired by the transformer encoder-decoder model. Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities.
arXiv Detail & Related papers (2023-11-27T20:48:54Z)
Spatial-Temporal Knowledge-Embedded Transformer for Video Scene Graph Generation [64.85974098314344]
Video scene graph generation (VidSGG) aims to identify objects in visual scenes and infer their relationships for a given video. Inherently, object pairs and their relationships enjoy spatial co-occurrence correlations within each image and temporal consistency/transition correlations across different images. We propose a spatial-temporal knowledge-embedded transformer (STKET) that incorporates the prior spatial-temporal knowledge into the multi-head cross-attention mechanism.
arXiv Detail & Related papers (2023-09-23T02:40:28Z)
Relational Temporal Graph Reasoning for Dual-task Dialogue Language Understanding [39.76268402567324]
Dual-task dialog understanding language aims to tackle two correlative dialog language understanding tasks simultaneously via their inherent correlations. We put forward a new framework, whose core is relational temporal graph reasoning. Our models outperform state-of-the-art models by a large margin.
arXiv Detail & Related papers (2023-06-15T13:19:08Z)
Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance Video [128.41392860714635]
We introduce Weakly-Supervised Snoma-Temporally Detection (WSSTAD) in surveillance video. WSSTAD aims to localize a-temporal tube (i.e. sequence of bounding boxes at consecutive times) that encloses abnormal event. We propose a dual-branch network which takes as input proposals with multi-granularities in both spatial-temporal domains.
arXiv Detail & Related papers (2021-08-09T06:11:14Z)
Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels. We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects. The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z)
Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations. We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z)
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos. Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding. At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network. With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.