Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition
- URL: http://arxiv.org/abs/2508.03695v1
- Date: Tue, 05 Aug 2025 17:59:58 GMT
- Title: Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition
- Authors: Pulkit Kumar, Shuaiyi Huang, Matthew Walmer, Sai Saketh Rambhatla, Abhinav Shrivastava,
- Abstract summary: Trokens is a novel approach that transforms trajectory points into semantic-aware relational tokens for action recognition.<n>We develop a motion modeling framework that captures both intra-trajectory dynamics through the Histogram of Oriented Displacements (HoD) and inter-trajectory relationships to model complex action patterns.<n>Our approach effectively combines these trajectory tokens with semantic features to enhance appearance features with motion information, achieving state-of-the-art performance across six diverse few-shot action recognition benchmarks.
- Score: 36.662223760818584
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video understanding requires effective modeling of both motion and appearance information, particularly for few-shot action recognition. While recent advances in point tracking have been shown to improve few-shot action recognition, two fundamental challenges persist: selecting informative points to track and effectively modeling their motion patterns. We present Trokens, a novel approach that transforms trajectory points into semantic-aware relational tokens for action recognition. First, we introduce a semantic-aware sampling strategy to adaptively distribute tracking points based on object scale and semantic relevance. Second, we develop a motion modeling framework that captures both intra-trajectory dynamics through the Histogram of Oriented Displacements (HoD) and inter-trajectory relationships to model complex action patterns. Our approach effectively combines these trajectory tokens with semantic features to enhance appearance features with motion information, achieving state-of-the-art performance across six diverse few-shot action recognition benchmarks: Something-Something-V2 (both full and small splits), Kinetics, UCF101, HMDB51, and FineGym. For project page see https://trokens-iccv25.github.io
Related papers
- SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation [56.90807453045657]
SynMotion is a motion-customized video generation model that jointly leverages semantic guidance and visual adaptation.<n>At the semantic level, we introduce the dual-em semantic comprehension mechanism which disentangles subject and motion representations.<n>At the visual level, we integrate efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence.
arXiv Detail & Related papers (2025-06-30T10:09:32Z) - Feature Hallucination for Self-supervised Action Recognition [37.20267786858476]
We propose a deep translational action recognition framework that enhances recognition accuracy by jointly predicting action concepts and auxiliary features from RGB video frames.<n>Our framework achieves state-of-the-art performance on multiple benchmarks, including Kinetics-400, Kinetics-600, and Something-Something V2, demonstrating its effectiveness in capturing fine-grained action dynamics.
arXiv Detail & Related papers (2025-06-25T11:50:23Z) - Learning Appearance and Motion Cues for Panoptic Tracking [13.062016289815057]
Panoptic tracking enables pixel-level scene of videos by integrating instance tracking in panoptic segmentation.<n>We propose a novel approach for panoptic tracking that simultaneously captures information and instance-specific appearance and motion features.
arXiv Detail & Related papers (2025-03-12T09:32:29Z) - Trajectory-aligned Space-time Tokens for Few-shot Action Recognition [34.97285458776108]
We build trajectory-aligned tokens (TATs) that capture motion and appearance information.
This approach significantly reduces the data requirements while retaining essential information.
We demonstrate state-of-the-art results on few-shot action recognition across multiple datasets.
arXiv Detail & Related papers (2024-07-25T17:59:31Z) - Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision.
We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos.
Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z) - Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets.
Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z) - MotionTrack: Learning Motion Predictor for Multiple Object Tracking [68.68339102749358]
We introduce a novel motion-based tracker, MotionTrack, centered around a learnable motion predictor.
Our experimental results demonstrate that MotionTrack yields state-of-the-art performance on datasets such as Dancetrack and SportsMOT.
arXiv Detail & Related papers (2023-06-05T04:24:11Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Object Discovery from Motion-Guided Tokens [50.988525184497334]
We augment the auto-encoder representation learning framework with motion-guidance and mid-level feature tokenization.
Our approach enables the emergence of interpretable object-specific mid-level features.
arXiv Detail & Related papers (2023-03-27T19:14:00Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.