Related papers: D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition

URL: http://arxiv.org/abs/2312.01431v3
Date: Sat, 20 Apr 2024 14:15:36 GMT
Title: D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition
Authors: Wenjie Pei, Qizhong Tan, Guangming Lu, Jiandong Tian,
Abstract summary: Adapting large pre-trained image models to few-shot action recognition has proven to be an effective strategy for learning robust feature extractors. We present the Disentangled-and-Deformable Spatio-Temporal Adapter (D$2$ST-Adapter), which is a novel tuning framework well-suited for few-shot action recognition.
Score: 60.84084172829169
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Adapting large pre-trained image models to few-shot action recognition has proven to be an effective and efficient strategy for learning robust feature extractors, which is essential for few-shot learning. Typical fine-tuning based adaptation paradigm is prone to overfitting in the few-shot learning scenarios and offers little modeling flexibility for learning temporal features in video data. In this work we present the Disentangled-and-Deformable Spatio-Temporal Adapter (D$^2$ST-Adapter), which is a novel adapter tuning framework well-suited for few-shot action recognition due to lightweight design and low parameter-learning overhead. It is designed in a dual-pathway architecture to encode spatial and temporal features in a disentangled manner. In particular, we devise the anisotropic Deformable Spatio-Temporal Attention module as the core component of D$^2$ST-Adapter, which can be tailored with anisotropic sampling densities along spatial and temporal domains to learn spatial and temporal features specifically in corresponding pathways, allowing our D$^2$ST-Adapter to encode features in a global view in 3D spatio-temporal space while maintaining a lightweight design. Extensive experiments with instantiations of our method on both pre-trained ResNet and ViT demonstrate the superiority of our method over state-of-the-art methods for few-shot action recognition. Our method is particularly well-suited to challenging scenarios where temporal dynamics are critical for action recognition.

Related papers

Terrain Sensing with Smartphone Structured Light: 2D Dynamic Time Warping for Grid Pattern Matching [0.0]
A smartphone-based structured-light system projects a grid pattern onto the ground and reconstructs local terrain unevenness from a single handheld device.<n>We propose a topology-constrained two-dimensional dynamic time warping (2D-DTW) algorithm that performs column-wise alignment under a global grid consistency constraint.<n>We demonstrate our 2D-DTW formulation can be used not only for terrain sensing but also as a general tool for matching structured grid patterns in image processing scenarios.
arXiv Detail & Related papers (2025-11-29T15:09:55Z)
Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [131.33758144860988]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z)
EVA02-AT: Egocentric Video-Language Understanding with Spatial-Temporal Rotary Positional Embeddings and Symmetric Optimization [17.622013322533423]
We introduce EVA02-AT, a suite of EVA02-based video-language foundation models tailored to egocentric video understanding tasks.<n> EVA02-AT efficiently transfers an image-based CLIP model into a unified video encoder via a single-stage pretraining.<n>We introduce the Symmetric Multi-Similarity (SMS) loss and a novel training framework that advances all soft labels for both positive and negative pairs.
arXiv Detail & Related papers (2025-06-17T09:51:51Z)
UniSTD: Towards Unified Spatio-Temporal Learning across Diverse Disciplines [64.84631333071728]
We introduce bfUnistage, a unified Transformer-based framework fortemporal modeling. Our work demonstrates that a task-specific vision-text can build a generalizable model fortemporal learning. We also introduce a temporal module to incorporate temporal dynamics explicitly.
arXiv Detail & Related papers (2025-03-26T17:33:23Z)
Graph and Skipped Transformer: Exploiting Spatial and Temporal Modeling Capacities for Efficient 3D Human Pose Estimation [36.93661496405653]
We take a global approach to exploit Transformer-temporal information with a concise Graph and Skipped Transformer architecture. Specifically, in 3D pose stage, coarse-grained body parts are deployed to construct a fully data-driven adaptive model. Experiments are conducted on Human3.6M, MPI-INF-3DHP and Human-Eva benchmarks.
arXiv Detail & Related papers (2024-07-03T10:42:09Z)
A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies [51.7643024367548]
Stable Diffusion Model is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation. This study focuses on reducing redundant computation in SDM and optimizing the model through both tuning and tuning-free methods.
arXiv Detail & Related papers (2024-05-31T21:47:05Z)
Dynamic 3D Point Cloud Sequences as 2D Videos [81.46246338686478]
3D point cloud sequences serve as one of the most common and practical representation modalities of real-world environments. We propose a novel generic representation called textitStructured Point Cloud Videos (SPCVs) SPCVs re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points.
arXiv Detail & Related papers (2024-03-02T08:18:57Z)
Deepfake Detection: Leveraging the Power of 2D and 3D CNN Ensembles [0.0]
This work presents an innovative approach to validate video content. The methodology blends advanced 2-dimensional and 3-dimensional Convolutional Neural Networks. Experimental validation underscores the effectiveness of this strategy, showcasing its potential in countering deepfakes generation.
arXiv Detail & Related papers (2023-10-25T06:00:37Z)
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video [15.952896909797728]
Adapting image models to the video domain has emerged as an efficient paradigm for solving video recognition tasks. Recent research is shifting its focus toward parameter-efficient image-to-video adaptation. We present a new adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks.
arXiv Detail & Related papers (2023-10-02T16:41:20Z)
Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos. The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters. Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z)
Deeply-Coupled Convolution-Transformer with Spatial-temporal Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID. Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z)
Dual-path Adaptation from Image to Video Transformers [62.056751480114784]
We efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters. We propose a novel DualPath adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block.
arXiv Detail & Related papers (2023-03-17T09:37:07Z)
Gait Recognition in the Wild with Multi-hop Temporal Switch [81.35245014397759]
gait recognition in the wild is a more practical problem that has attracted the attention of the community of multimedia and computer vision. This paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes.
arXiv Detail & Related papers (2022-09-01T10:46:09Z)
Motion-aware Memory Network for Fast Video Salient Object Detection [15.967509480432266]
We design a space-time memory (STM)-based network, which extracts useful temporal information of the current frame from adjacent frames as the temporal branch of VSOD. In the encoding stage, we generate high-level temporal features by using high-level features from the current and its adjacent frames. In the decoding stage, we propose an effective fusion strategy for spatial and temporal branches. The proposed model does not require optical flow or other preprocessing, and can reach a speed of nearly 100 FPS during inference.
arXiv Detail & Related papers (2022-08-01T15:56:19Z)
Exploring Temporal Coherence for More General Video Face Forgery Detection [22.003901822221227]
We propose a novel end-to-end framework, which consists of two major stages. The first stage is a fully temporal convolution network (FTCN), while maintaining the temporal convolution kernel size unchanged. The second stage is a Temporal Transformer network, which aims to explore the long-term temporal coherence.
arXiv Detail & Related papers (2021-08-15T08:45:37Z)
Adaptive Latent Space Tuning for Non-Stationary Distributions [62.997667081978825]
We present a method for adaptive tuning of the low-dimensional latent space of deep encoder-decoder style CNNs. We demonstrate our approach for predicting the properties of a time-varying charged particle beam in a particle accelerator.
arXiv Detail & Related papers (2021-05-08T03:50:45Z)
STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition [39.58542259261567]
We present a novel S-Temporal Hybrid Network (STH) which simultaneously encodes spatial and temporal video information with a small parameter. Such a design enables efficient-temporal modeling and maintains a small model scale. STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.
arXiv Detail & Related papers (2020-03-18T04:46:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.