Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning
- URL: http://arxiv.org/abs/2502.13754v1
- Date: Wed, 19 Feb 2025 14:16:47 GMT
- Title: Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning
- Authors: Caihua Liu, Xu Li, Wenjing Xue, Wei Tang, Xia Feng,
- Abstract summary: Existing video captioning methods merely provide shallow or simplistic representations of object behaviors.<n>We propose a dynamic action semantic-aware graph transformer to comprehensively capture the essence of object behavior.
- Score: 13.411096520754507
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing video captioning methods merely provide shallow or simplistic representations of object behaviors, resulting in superficial and ambiguous descriptions. However, object behavior is dynamic and complex. To comprehensively capture the essence of object behavior, we propose a dynamic action semantic-aware graph transformer. Firstly, a multi-scale temporal modeling module is designed to flexibly learn long and short-term latent action features. It not only acquires latent action features across time scales, but also considers local latent action details, enhancing the coherence and sensitiveness of latent action representations. Secondly, a visual-action semantic aware module is proposed to adaptively capture semantic representations related to object behavior, enhancing the richness and accurateness of action representations. By harnessing the collaborative efforts of these two modules,we can acquire rich behavior representations to generate human-like natural descriptions. Finally, this rich behavior representations and object representations are used to construct a temporal objects-action graph, which is fed into the graph transformer to model the complex temporal dependencies between objects and actions. To avoid adding complexity in the inference phase, the behavioral knowledge of the objects will be distilled into a simple network through knowledge distillation. The experimental results on MSVD and MSR-VTT datasets demonstrate that the proposed method achieves significant performance improvements across multiple metrics.
Related papers
- Hierarchical Action Learning for Weakly-Supervised Action Segmentation [43.688046710022626]
We propose the Hierarchical Action Learning (textbfHAL) model for weakly-supervised action segmentation.<n>Our approach introduces a hierarchical causal data generation process, where high-level latent action governs the dynamics of low-level visual features.<n> Experimental results show that the textbfHAL model significantly outperforms existing methods for weakly-supervised action segmentation.
arXiv Detail & Related papers (2026-02-27T18:48:22Z) - Precise Action-to-Video Generation Through Visual Action Prompts [62.951609704196485]
Action-driven video generation faces a precision-generality trade-off.<n>Agent-centric action signals provide precision at the cost of cross-domain transferability.<n>We "render" actions into precise visual prompts as domain-agnostic representations.
arXiv Detail & Related papers (2025-08-18T17:12:28Z) - Semantic Item Graph Enhancement for Multimodal Recommendation [49.66272783945571]
Multimodal recommendation systems have attracted increasing attention for their improved performance by leveraging items' multimodal information.<n>Prior methods often build modality-specific item-item semantic graphs from raw modality features.<n>These semantic graphs suffer from semantic deficiencies, including insufficient modeling of collaborative signals among items.
arXiv Detail & Related papers (2025-08-08T09:20:50Z) - Trokens: Semantic-Aware Relational Trajectory Tokens for Few-Shot Action Recognition [36.662223760818584]
Trokens is a novel approach that transforms trajectory points into semantic-aware relational tokens for action recognition.<n>We develop a motion modeling framework that captures both intra-trajectory dynamics through the Histogram of Oriented Displacements (HoD) and inter-trajectory relationships to model complex action patterns.<n>Our approach effectively combines these trajectory tokens with semantic features to enhance appearance features with motion information, achieving state-of-the-art performance across six diverse few-shot action recognition benchmarks.
arXiv Detail & Related papers (2025-08-05T17:59:58Z) - LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching [25.883546163390957]
We endow CLIP with fine-grained action-level understanding by incorporating action-related external knowledge generated by large language models (LLMs)<n>We propose an adaptive interaction module to aggregate attentive visual features conditioned on action-aware prompted knowledge for establishing discriminative and action-aware visual representations.
arXiv Detail & Related papers (2025-06-30T03:49:08Z) - A Grammatical Compositional Model for Video Action Detection [24.546886938243393]
We present a novel Grammatical Compositional Model (GCM) for action detection based on typical And-Or graphs.
Our model exploits the intrinsic structures and latent relationships of actions in a hierarchical manner to harness both the compositionality of grammar models and the capability of expressing rich features of DNNs.
arXiv Detail & Related papers (2023-10-04T15:24:00Z) - ACID: Action-Conditional Implicit Visual Dynamics for Deformable Object
Manipulation [135.10594078615952]
We introduce ACID, an action-conditional visual dynamics model for volumetric deformable objects.
A benchmark contains over 17,000 action trajectories with six types of plush toys and 78 variants.
Our model achieves the best performance in geometry, correspondence, and dynamics predictions.
arXiv Detail & Related papers (2022-03-14T04:56:55Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - Object-Region Video Transformers [100.23380634952083]
We present Object-Region Transformers Video (ORViT), an emphobject-centric approach that extends transformer video layers with object representations.
Our ORViT block consists of two object-level streams: appearance and dynamics.
We show strong improvement in performance across all tasks and considered, demonstrating the value of a model that incorporates object representations into a transformer architecture.
arXiv Detail & Related papers (2021-10-13T17:51:46Z) - HyperDynamics: Meta-Learning Object and Agent Dynamics with
Hypernetworks [18.892883695539002]
HyperDynamics is a dynamics meta-learning framework that generates parameters of neural dynamics models.
It outperforms existing models that adapt to environment variations by learning dynamics over high dimensional visual observations.
We show our method matches the performance of an ensemble of separately trained experts, while also being able to generalize well to unseen environment variations at test time.
arXiv Detail & Related papers (2021-03-17T04:48:43Z) - Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes.
Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z) - Progressive Self-Guided Loss for Salient Object Detection [102.35488902433896]
We present a progressive self-guided loss function to facilitate deep learning-based salient object detection in images.
Our framework takes advantage of adaptively aggregated multi-scale features to locate and detect salient objects effectively.
arXiv Detail & Related papers (2021-01-07T07:33:38Z) - Learning to Represent Action Values as a Hypergraph on the Action
Vertices [17.811355496708728]
Action-value estimation is a critical component of reinforcement learning (RL) methods.
We conjecture that leveraging the structure of multi-dimensional action spaces is a key ingredient for learning good representations of action.
We show the effectiveness of our approach on a myriad of domains: illustrative prediction problems under minimal confounding effects, Atari 2600 games, and discretised physical control benchmarks.
arXiv Detail & Related papers (2020-10-28T00:19:13Z) - Inferring Temporal Compositions of Actions Using Probabilistic Automata [61.09176771931052]
We propose to express temporal compositions of actions as semantic regular expressions and derive an inference framework using probabilistic automata.
Our approach is different from existing works that either predict long-range complex activities as unordered sets of atomic actions, or retrieve videos using natural language sentences.
arXiv Detail & Related papers (2020-04-28T00:15:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.