Capturing Temporal Information in a Single Frame: Channel Sampling
Strategies for Action Recognition
- URL: http://arxiv.org/abs/2201.10394v1
- Date: Tue, 25 Jan 2022 15:24:37 GMT
- Title: Capturing Temporal Information in a Single Frame: Channel Sampling
Strategies for Action Recognition
- Authors: Kiyoon Kim, Shreyank N Gowda, Oisin Mac Aodha, Laura Sevilla-Lara
- Abstract summary: We address the problem of capturing temporal information for video classification in 2D networks, without increasing computational cost.
We propose a novel sampling strategy, where we re-order the channels of the input video, to capture short-term frame-to-frame changes.
Our sampling strategies do not require training from scratch and do not increase the computational cost of training and testing.
- Score: 19.220288614585147
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the problem of capturing temporal information for video
classification in 2D networks, without increasing computational cost. Existing
approaches focus on modifying the architecture of 2D networks (e.g. by
including filters in the temporal dimension to turn them into 3D networks, or
using optical flow, etc.), which increases computation cost. Instead, we
propose a novel sampling strategy, where we re-order the channels of the input
video, to capture short-term frame-to-frame changes. We observe that without
bells and whistles, the proposed sampling strategy improves performance on
multiple architectures (e.g. TSN, TRN, and TSM) and datasets (CATER,
Something-Something-V1 and V2), up to 24% over the baseline of using the
standard video input. In addition, our sampling strategies do not require
training from scratch and do not increase the computational cost of training
and testing. Given the generality of the results and the flexibility of the
approach, we hope this can be widely useful to the video understanding
community. Code is available at https://github.com/kiyoon/PyVideoAI.
Related papers
- EPS: Efficient Patch Sampling for Video Overfitting in Deep Super-Resolution Model Training [15.684865589513597]
We propose an efficient patch sampling method named EPS for video SR network overfitting.
Our method reduces the number of patches for the training to 4% to 25%, depending on the resolution and number of clusters.
Compared to the state-of-the-art patch sampling method, EMT, our approach achieves an 83% decrease in overall run time.
arXiv Detail & Related papers (2024-11-25T12:01:57Z) - TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Semi-supervised 3D Video Information Retrieval with Deep Neural Network
and Bi-directional Dynamic-time Warping Algorithm [14.39527406033429]
The proposed algorithm is designed to handle large video datasets and retrieve the most related videos to a given inquiry video clip.
We split both the candidate and the inquiry videos into a sequence of clips and convert each clip to a representation vector using an autoencoder-backed deep neural network.
We then calculate a similarity measure between the sequences of embedding vectors using a bi-directional dynamic time-warping method.
arXiv Detail & Related papers (2023-09-03T03:10:18Z) - NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition [89.84188594758588]
A novel Non-saliency Suppression Network (NSNet) is proposed to suppress the responses of non-salient frames.
NSNet achieves the state-of-the-art accuracy-efficiency trade-off and presents a significantly faster (2.44.3x) practical inference speed than state-of-the-art methods.
arXiv Detail & Related papers (2022-07-21T09:41:22Z) - Efficient Meta-Tuning for Content-aware Neural Video Delivery [40.3731358963689]
We present Efficient Meta-Tuning (EMT) to reduce the computational cost.
EMT adapts a meta-learned model to the first chunk of the input video.
We propose a novel sampling strategy to extract the most challenging patches from video frames.
arXiv Detail & Related papers (2022-07-20T06:47:10Z) - Optical-Flow-Reuse-Based Bidirectional Recurrent Network for Space-Time
Video Super-Resolution [52.899234731501075]
Space-time video super-resolution (ST-VSR) simultaneously increases the spatial resolution and frame rate for a given video.
Existing methods typically suffer from difficulties in how to efficiently leverage information from a large range of neighboring frames.
We propose a coarse-to-fine bidirectional recurrent neural network instead of using ConvLSTM to leverage knowledge between adjacent frames.
arXiv Detail & Related papers (2021-10-13T15:21:30Z) - Dual-view Snapshot Compressive Imaging via Optical Flow Aided Recurrent
Neural Network [14.796204921975733]
Dual-view snapshot compressive imaging (SCI) aims to capture videos from two field-of-views (FoVs) in a single snapshot.
It is challenging for existing model-based decoding algorithms to reconstruct each individual scene.
We propose an optical flow-aided recurrent neural network for dual video SCI systems, which provides high-quality decoding in seconds.
arXiv Detail & Related papers (2021-09-11T14:24:44Z) - Temporal Bilinear Encoding Network of Audio-Visual Features at Low
Sampling Rates [7.1273332508471725]
This paper aims to exploit audio-visual information in video classification with a 1 frame per second sampling rate.
We propose Temporal Bilinear Networks (TBEN) for encoding both audio and visual long range temporal information.
arXiv Detail & Related papers (2020-12-18T14:59:34Z) - A Real-time Action Representation with Temporal Encoding and Deep
Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation.
T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed.
Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z) - Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video
Super-Resolution [95.26202278535543]
A simple solution is to split it into two sub-tasks: video frame (VFI) and video super-resolution (VSR)
temporalsynthesis and spatial super-resolution are intra-related in this task.
We propose a one-stage space-time video super-resolution framework, which directly synthesizes an HR slow-motion video from an LFR, LR video.
arXiv Detail & Related papers (2020-02-26T16:59:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.