Related papers: Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP

Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP

URL: http://arxiv.org/abs/2412.09895v2
Date: Sun, 09 Feb 2025 12:42:37 GMT
Title: Building a Multi-modal Spatiotemporal Expert for Zero-shot Action Recognition with CLIP
Authors: Yating Yu, Congqi Cao, Yueran Zhang, Qinyi Lv, Lingtong Min, Yanning Zhang,
Abstract summary: We propose a novel CLI framework to comprehend multi-temporal dynamics.<n>For the vision side, we propose an efficient Dynamic Crossshot Attention.<n>For the semantic side, we conduct text augmentation by constructing an Action Knowledge Graph.
Score: 34.88916568947695
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Zero-shot action recognition (ZSAR) requires collaborative multi-modal spatiotemporal understanding. However, finetuning CLIP directly for ZSAR yields suboptimal performance, given its inherent constraints in capturing essential temporal dynamics from both vision and text perspectives, especially when encountering novel actions with fine-grained spatiotemporal discrepancies. In this work, we propose Spatiotemporal Dynamic Duo (STDD), a novel CLIP-based framework to comprehend multi-modal spatiotemporal dynamics synergistically. For the vision side, we propose an efficient Space-time Cross Attention, which captures spatiotemporal dynamics flexibly with simple yet effective operations applied before and after spatial attention, without adding additional parameters or increasing computational complexity. For the semantic side, we conduct spatiotemporal text augmentation by comprehensively constructing an Action Semantic Knowledge Graph (ASKG) to derive nuanced text prompts. The ASKG elaborates on static and dynamic concepts and their interrelations, based on the idea of decomposing actions into spatial appearances and temporal motions. During the training phase, the frame-level video representations are meticulously aligned with prompt-level nuanced text representations, which are concurrently regulated by the video representations from the frozen CLIP to enhance generalizability. Extensive experiments validate the effectiveness of our approach, which consistently surpasses state-of-the-art approaches on popular video benchmarks (i.e., Kinetics-600, UCF101, and HMDB51) under challenging ZSAR settings.

Related papers

Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [66.97034863216892]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z)
StPR: Spatiotemporal Preservation and Routing for Exemplar-Free Video Class-Incremental Learning [51.003833566279006]
Class-Incremental Learning (CIL) seeks to develop models that continuously learn new action categories over time without previously acquired knowledge.<n>Existing approaches either rely on forgetting, raising concerns over memory and privacy, or adapt static image-based methods that neglect temporal modeling.<n>We propose a unified and exemplar-free VCIL framework that explicitly disentangles and preserves information.
arXiv Detail & Related papers (2025-05-20T06:46:51Z)
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding [48.12128042470839]
We propose an integrated Spatial-TempOral dynamic Prompting (STOP) model. It consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting. STOP consistently achieves superior performance against state-of-the-art methods.
arXiv Detail & Related papers (2025-03-20T09:16:20Z)
EvolvingGS: High-Fidelity Streamable Volumetric Video via Evolving 3D Gaussian Representation [14.402479944396665]
We introduce EvolvingGS, a two-stage strategy that first deforms the Gaussian model to align with the target frame, and then refines it with minimal point addition/subtraction. Owing to the flexibility of the incrementally evolving representation, our method outperforms existing approaches in terms of both per-frame and temporal quality metrics. Our method significantly advances the state-of-the-art in dynamic scene reconstruction, particularly for extended sequences with complex human performances.
arXiv Detail & Related papers (2025-03-07T06:01:07Z)
ARN-LSTM: A Multi-Stream Fusion Model for Skeleton-based Action Recognition [5.86850933017833]
ARN-LSTM architecture is designed to address the challenge of simultaneously capturing spatial motion and temporal dynamics in action sequences.<n>Our proposed model integrates joint, motion, and temporal information through a multi-stream fusion architecture.
arXiv Detail & Related papers (2024-11-04T03:29:51Z)
Low-Light Video Enhancement via Spatial-Temporal Consistent Illumination and Reflection Decomposition [68.6707284662443]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic and static scenes plagued by severe invisibility and noise. One critical aspect is formulating a consistency constraint specifically for temporal-spatial illumination and appearance enhanced versions. We present an innovative video Retinex-based decomposition strategy that operates without the need for explicit supervision.
arXiv Detail & Related papers (2024-05-24T15:56:40Z)
MA-FSAR: Multimodal Adaptation of CLIP for Few-Shot Action Recognition [41.78245303513613]
We introduce MA-FSAR, a framework that employs the Fine-Tuning (PEFT) technique to enhance the CLIP visual encoder in terms of action-related temporal and semantic representations. In addition to these token-level designs, we propose a prototype-level text-guided construction module to further enrich the temporal and semantic characteristics of video prototypes.
arXiv Detail & Related papers (2023-08-03T04:17:25Z)
Implicit Temporal Modeling with Learnable Alignment for Video Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance. ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z)
Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models. We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z)
Spatiotemporal Decouple-and-Squeeze Contrastive Learning for Semi-Supervised Skeleton-based Action Recognition [12.601122522537459]
We propose a novel Stemporal Decouple Contrastive Learning (SDS-CL) framework to learn more abundant representations of skeleton-based actions. We present a new Temporal-squeezing Loss (STL), a new Temporal-squeezing Loss (TSL), and the Global-contrasting Loss (GL) to contrast the spatial-squeezing joint and motion features at the frame level, temporal-squeezing joint and motion features at the joint level, as well as global joint and motion features at the skeleton level.
arXiv Detail & Related papers (2023-02-05T06:52:25Z)
Learning Self-Similarity in Space and Time as Generalized Motion for Action Recognition [42.175450800733785]
We propose a rich motion representation based on video self-similarity (STSS) We leverage the whole volume of STSSS and let our model learn to extract an effective motion representation from it. The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision.
arXiv Detail & Related papers (2021-02-14T07:32:55Z)
Learn to cycle: Time-consistent feature discovery for action recognition [83.43682368129072]
Generalizing over temporal variations is a prerequisite for effective action recognition in videos. We introduce Squeeze Re Temporal Gates (SRTG), an approach that favors temporal activations with potential variations. We show consistent improvement when using SRTPG blocks, with only a minimal increase in the number of GFLOs.
arXiv Detail & Related papers (2020-06-15T09:36:28Z)
A Spatial-Temporal Attentive Network with Spatial Continuity for Trajectory Prediction [74.00750936752418]
We propose a novel model named spatial-temporal attentive network with spatial continuity (STAN-SC) First, spatial-temporal attention mechanism is presented to explore the most useful and important information. Second, we conduct a joint feature sequence based on the sequence and instant state information to make the generative trajectories keep spatial continuity.
arXiv Detail & Related papers (2020-03-13T04:35:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.