Taylor Videos for Action Recognition
- URL: http://arxiv.org/abs/2402.03019v4
- Date: Fri, 10 May 2024 14:45:56 GMT
- Title: Taylor Videos for Action Recognition
- Authors: Lei Wang, Xiuyuan Yuan, Tom Gedeon, Liang Zheng,
- Abstract summary: Taylor video is a new video format that highlights the dominate motions in each of its frames named the Taylor frame.
Taylor video is named after Taylor series, which approximates a function at a given point using important terms.
We show that Taylor videos are effective inputs to popular architectures including 2D CNNs, 3D CNNs, and transformers.
- Score: 15.728388101131056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Effectively extracting motions from video is a critical and long-standing problem for action recognition. This problem is very challenging because motions (i) do not have an explicit form, (ii) have various concepts such as displacement, velocity, and acceleration, and (iii) often contain noise caused by unstable pixels. Addressing these challenges, we propose the Taylor video, a new video format that highlights the dominate motions (e.g., a waving hand) in each of its frames named the Taylor frame. Taylor video is named after Taylor series, which approximates a function at a given point using important terms. In the scenario of videos, we define an implicit motion-extraction function which aims to extract motions from video temporal block. In this block, using the frames, the difference frames, and higher-order difference frames, we perform Taylor expansion to approximate this function at the starting frame. We show the summation of the higher-order terms in the Taylor series gives us dominant motion patterns, where static objects, small and unstable motions are removed. Experimentally we show that Taylor videos are effective inputs to popular architectures including 2D CNNs, 3D CNNs, and transformers. When used individually, Taylor videos yield competitive action recognition accuracy compared to RGB videos and optical flow. When fused with RGB or optical flow videos, further accuracy improvement is achieved. Additionally, we apply Taylor video computation to human skeleton sequences, resulting in Taylor skeleton sequences that outperform the use of original skeletons for skeleton-based action recognition.
Related papers
- Unfolding Videos Dynamics via Taylor Expansion [5.723852805622308]
We present a new self-supervised dynamics learning strategy for videos: Video Time-Differentiation for Instance Discrimination (ViDiDi)
ViDiDi observes different aspects of a video through various orders of temporal derivatives of its frame sequence.
ViDiDi learns a single neural network that encodes a video and its temporal derivatives into consistent embeddings.
arXiv Detail & Related papers (2024-09-04T01:41:09Z) - Motion meets Attention: Video Motion Prompts [34.429192862783054]
We propose a modified Sigmoid function with learnable slope and shift parameters as an attention mechanism to modulate motion signals from frame differencing maps.
This approach generates a sequence of attention maps that enhance the processing of motion-related video content.
We show that our lightweight, plug-and-play motion prompt layer seamlessly integrates into models like SlowGym, X3D, and Timeformer.
arXiv Detail & Related papers (2024-07-03T14:59:46Z) - Splatter a Video: Video Gaussian Representation for Versatile Processing [48.9887736125712]
Video representation is crucial for various down-stream tasks, such as tracking,depth prediction,segmentation,view synthesis,and editing.
We introduce a novel explicit 3D representation-video Gaussian representation -- that embeds a video into 3D Gaussians.
It has been proven effective in numerous video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereoscopic video generation.
arXiv Detail & Related papers (2024-06-19T22:20:03Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - Learning Trajectory-Aware Transformer for Video Super-Resolution [50.49396123016185]
Video super-resolution aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts.
Existing approaches usually align and aggregate video frames from limited adjacent frames.
We propose a novel Transformer for Video Super-Resolution (TTVSR)
arXiv Detail & Related papers (2022-04-08T03:37:39Z) - Taylor Swift: Taylor Driven Temporal Modeling for Swift Future Frame
Prediction [22.57791389884491]
We introduce TayloSwiftNet, a novel convolutional neural network that learns to estimate the higher order terms of the Taylor series for a given input video.
TayloSwiftNet can swiftly predict any desired future frame in just one forward pass and change the temporal resolution on-the-fly.
arXiv Detail & Related papers (2021-10-27T12:46:17Z) - NeuralDiff: Segmenting 3D objects that move in egocentric videos [92.95176458079047]
We study the problem of decomposing the observed 3D scene into a static background and a dynamic foreground.
This task is reminiscent of the classic background subtraction problem, but is significantly harder because all parts of the scene, static and dynamic, generate a large apparent motion.
In particular, we consider egocentric videos and further separate the dynamic component into objects and the actor that observes and moves them.
arXiv Detail & Related papers (2021-10-19T12:51:35Z) - Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers [77.52828273633646]
We present a new drop-in block for video transformers that aggregates information along implicitly determined motion paths.
We also propose a new method to address the quadratic dependence of computation and memory on the input size.
We obtain state-of-the-art results on the Kinetics, Something--Something V2, and Epic-Kitchens datasets.
arXiv Detail & Related papers (2021-06-09T21:16:05Z) - Taylor saves for later: disentanglement for video prediction using
Taylor representation [5.658571172210811]
We propose a two-branch seq-to-seq deep model to disentangle the Taylor feature and the residual feature in video frames.
TaylorCell can expand the video frames' high-dimensional features into the finite Taylor series to describe the latent laws.
MCU distills all past frames' information to correct the predicted Taylor feature from TPU.
arXiv Detail & Related papers (2021-05-24T01:59:21Z) - Enhancing Unsupervised Video Representation Learning by Decoupling the
Scene and the Motion [86.56202610716504]
Action categories are highly related with the scene where the action happens, making the model tend to degrade to a solution where only the scene information is encoded.
We propose to decouple the scene and the motion (DSM) with two simple operations, so that the model attention towards the motion information is better paid.
arXiv Detail & Related papers (2020-09-12T09:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.