Implicit Temporal Modeling with Learnable Alignment for Video
Recognition
- URL: http://arxiv.org/abs/2304.10465v2
- Date: Tue, 15 Aug 2023 08:04:00 GMT
- Title: Implicit Temporal Modeling with Learnable Alignment for Video
Recognition
- Authors: Shuyuan Tu, Qi Dai, Zuxuan Wu, Zhi-Qi Cheng, Han Hu, Yu-Gang Jiang
- Abstract summary: We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
- Score: 95.82093301212964
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive language-image pretraining (CLIP) has demonstrated remarkable
success in various image tasks. However, how to extend CLIP with effective
temporal modeling is still an open and crucial problem. Existing factorized or
joint spatial-temporal modeling trades off between the efficiency and
performance. While modeling temporal information within straight through tube
is widely adopted in literature, we find that simple frame alignment already
provides enough essence without temporal attention. To this end, in this paper,
we proposed a novel Implicit Learnable Alignment (ILA) method, which minimizes
the temporal modeling effort while achieving incredibly high performance.
Specifically, for a frame pair, an interactive point is predicted in each
frame, serving as a mutual information rich region. By enhancing the features
around the interactive point, two frames are implicitly aligned. The aligned
features are then pooled into a single token, which is leveraged in the
subsequent spatial self-attention. Our method allows eliminating the costly or
insufficient temporal self-attention in video. Extensive experiments on
benchmarks demonstrate the superiority and generality of our module.
Particularly, the proposed ILA achieves a top-1 accuracy of 88.7% on
Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H. Code is
released at https://github.com/Francis-Rings/ILA .
Related papers
- SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition [18.542942459854867]
Large amounts of video samples are continuously required for traditional data-driven research.
We propose a novel plug-and-play architecture for action recognition called Stemp-Oral frAme tuwenle (SOAP) in this paper.
SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and SOAP51.
arXiv Detail & Related papers (2024-07-23T09:45:25Z) - FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [85.29772293776395]
We introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint.
This enhancement ensures a more consistent transformation of semantically similar content across frames.
Our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video.
arXiv Detail & Related papers (2024-03-19T17:59:18Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - Alignment-guided Temporal Attention for Video Action Recognition [18.5171795689609]
We show that frame-by-frame alignments have the potential to increase the mutual information between frame representations.
We propose Alignment-guided Temporal Attention (ATA) to extend 1-dimensional temporal attention with parameter-free patch-level alignments between neighboring frames.
arXiv Detail & Related papers (2022-09-30T23:10:47Z) - KORSAL: Key-point Detection based Online Real-Time Spatio-Temporal
Action Localization [0.9507070656654633]
Real-time and online action localization in a video is a critical yet highly challenging problem.
Recent attempts achieve this by using computationally intensive 3D CNN architectures or highly redundant two-stream architectures with optical flow.
We propose utilizing fast and efficient key-point based bounding box prediction to spatially localize actions.
Our model achieves a frame rate of 41.8 FPS, which is a 10.7% improvement over contemporary real-time methods.
arXiv Detail & Related papers (2021-11-05T08:39:36Z) - TimeLens: Event-based Video Frame Interpolation [54.28139783383213]
We introduce Time Lens, a novel indicates equal contribution method that leverages the advantages of both synthesis-based and flow-based approaches.
We show an up to 5.21 dB improvement in terms of PSNR over state-of-the-art frame-based and event-based methods.
arXiv Detail & Related papers (2021-06-14T10:33:47Z) - Approximated Bilinear Modules for Temporal Modeling [116.6506871576514]
Two-layers in CNNs can be converted to temporal bilinear modules by adding an auxiliary-branch sampling.
Our models can outperform most state-of-the-art methods on SomethingSomething v1 and v2 datasets without pretraining.
arXiv Detail & Related papers (2020-07-25T09:07:35Z) - All at Once: Temporally Adaptive Multi-Frame Interpolation with Advanced
Motion Modeling [52.425236515695914]
State-of-the-art methods are iterative solutions interpolating one frame at the time.
This work introduces a true multi-frame interpolator.
It utilizes a pyramidal style network in the temporal domain to complete the multi-frame task in one-shot.
arXiv Detail & Related papers (2020-07-23T02:34:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.