Learning spatio-temporal representations with temporal squeeze pooling
- URL: http://arxiv.org/abs/2002.04685v2
- Date: Thu, 13 Jan 2022 00:47:40 GMT
- Title: Learning spatio-temporal representations with temporal squeeze pooling
- Authors: Guoxi Huang and Adrian G. Bors
- Abstract summary: We propose a new video representation learning method, named Temporal Squeeze (TS) pooling, which can extract the essential movement information from a long sequence of video frames and map it into a set of few images, named Squeezed Images.
The resulting Squeezed Images contain the essential movement information from the video frames, corresponding to the optimization of the video classification task.
We evaluate our architecture on two video classification benchmarks, and the results achieved are compared to the state-of-the-art.
- Score: 11.746833714322154
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a new video representation learning method, named
Temporal Squeeze (TS) pooling, which can extract the essential movement
information from a long sequence of video frames and map it into a set of few
images, named Squeezed Images. By embedding the Temporal Squeeze pooling as a
layer into off-the-shelf Convolution Neural Networks (CNN), we design a new
video classification model, named Temporal Squeeze Network (TeSNet). The
resulting Squeezed Images contain the essential movement information from the
video frames, corresponding to the optimization of the video classification
task. We evaluate our architecture on two video classification benchmarks, and
the results achieved are compared to the state-of-the-art.
Related papers
- Temporal-Spatial Processing of Event Camera Data via Delay-Loop Reservoir Neural Network [0.11309478649967238]
We study a conjecture motivated by our previous study of video processing with delay loop reservoir neural network.
In this paper, we will exploit this new finding to guide our design of a delay-loop reservoir neural network for event camera classification.
arXiv Detail & Related papers (2024-02-12T16:24:13Z) - RTQ: Rethinking Video-language Understanding Based on Image-text Model [55.278942477715084]
Video-language understanding presents unique challenges due to the inclusion of highly complex semantic details.
We propose a novel framework called RTQ, which addresses these challenges simultaneously.
Our model demonstrates outstanding performance even in the absence of video-language pre-training.
arXiv Detail & Related papers (2023-12-01T04:51:01Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - Representation Recycling for Streaming Video Analysis [19.068248496174903]
StreamDEQ aims to infer frame-wise representations on videos with minimal per-frame computation.
We show that StreamDEQ is able to recover near-optimal representations in a few frames' time and maintain an up-to-date representation throughout the video duration.
arXiv Detail & Related papers (2022-04-28T13:35:14Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - Video Content Classification using Deep Learning [0.0]
This paper presents a model that is a combination of Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN)
The model can identify the type of video content and classify them into categories such as "Animation, Gaming, natural content, flat content, etc"
arXiv Detail & Related papers (2021-11-27T04:36:17Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z) - Temporal Context Aggregation for Video Retrieval with Contrastive
Learning [81.12514007044456]
We propose TCA, a video representation learning framework that incorporates long-range temporal information between frame-level features.
The proposed method shows a significant performance advantage (17% mAP on FIVR-200K) over state-of-the-art methods with video-level features.
arXiv Detail & Related papers (2020-08-04T05:24:20Z) - Temporally Distributed Networks for Fast Video Semantic Segmentation [64.5330491940425]
TDNet is a temporally distributed network designed for fast and accurate video semantic segmentation.
We observe that features extracted from a certain high-level layer of a deep CNN can be approximated by composing features extracted from several shallower sub-networks.
Experiments on Cityscapes, CamVid, and NYUD-v2 demonstrate that our method achieves state-of-the-art accuracy with significantly faster speed and lower latency.
arXiv Detail & Related papers (2020-04-03T22:43:32Z) - Fine-Grained Instance-Level Sketch-Based Video Retrieval [159.12935292432743]
We propose a novel cross-modal retrieval problem of fine-grained instance-level sketch-based video retrieval (FG-SBVR)
Compared with sketch-based still image retrieval, and coarse-grained category-level video retrieval, this is more challenging as both visual appearance and motion need to be simultaneously matched at a fine-grained level.
We show that this model significantly outperforms a number of existing state-of-the-art models designed for video analysis.
arXiv Detail & Related papers (2020-02-21T18:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.