Temporal-attentive Covariance Pooling Networks for Video Recognition
- URL: http://arxiv.org/abs/2110.14381v2
- Date: Thu, 28 Oct 2021 01:49:03 GMT
- Title: Temporal-attentive Covariance Pooling Networks for Video Recognition
- Authors: Zilin Gao, Qilong Wang, Bingbing Zhang, Qinghua Hu, Peihua Li
- Abstract summary: existing video architectures usually generate global representation by using a simple global average pooling (GAP) method.
This paper proposes a attentive Covariance Pooling( TCP- TCP), inserted at the end of deep architectures, to produce powerful video representations.
Our TCP is model-agnostic and can be flexibly integrated into any video architectures, resulting in TCPNet for effective video recognition.
- Score: 52.853765492522655
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For video recognition task, a global representation summarizing the whole
contents of the video snippets plays an important role for the final
performance. However, existing video architectures usually generate it by using
a simple, global average pooling (GAP) method, which has limited ability to
capture complex dynamics of videos. For image recognition task, there exist
evidences showing that covariance pooling has stronger representation ability
than GAP. Unfortunately, such plain covariance pooling used in image
recognition is an orderless representative, which cannot model spatio-temporal
structure inherent in videos. Therefore, this paper proposes a
Temporal-attentive Covariance Pooling(TCP), inserted at the end of deep
architectures, to produce powerful video representations. Specifically, our TCP
first develops a temporal attention module to adaptively calibrate
spatio-temporal features for the succeeding covariance pooling, approximatively
producing attentive covariance representations. Then, a temporal covariance
pooling performs temporal pooling of the attentive covariance representations
to characterize both intra-frame correlations and inter-frame
cross-correlations of the calibrated features. As such, the proposed TCP can
capture complex temporal dynamics. Finally, a fast matrix power normalization
is introduced to exploit geometry of covariance representations. Note that our
TCP is model-agnostic and can be flexibly integrated into any video
architectures, resulting in TCPNet for effective video recognition. The
extensive experiments on six benchmarks (e.g., Kinetics, Something-Something V1
and Charades) using various video architectures show our TCPNet is clearly
superior to its counterparts, while having strong generalization ability. The
source code is publicly available.
Related papers
- Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Neighbor Correspondence Matching for Flow-based Video Frame Synthesis [90.14161060260012]
We introduce a neighbor correspondence matching (NCM) algorithm for flow-based frame synthesis.
NCM is performed in a current-frame-agnostic fashion to establish multi-scale correspondences in the spatial-temporal neighborhoods of each pixel.
coarse-scale module is designed to leverage neighbor correspondences to capture large motion, while the fine-scale module is more efficient to speed up the estimation process.
arXiv Detail & Related papers (2022-07-14T09:17:00Z) - Cross-Architecture Self-supervised Video Representation Learning [42.267775859095664]
We present a new cross-architecture contrastive learning framework for self-supervised video representation learning.
We introduce a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences.
We evaluate our method on the tasks of video retrieval and action recognition on UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2022-05-26T12:41:19Z) - Representation Recycling for Streaming Video Analysis [19.068248496174903]
StreamDEQ aims to infer frame-wise representations on videos with minimal per-frame computation.
We show that StreamDEQ is able to recover near-optimal representations in a few frames' time and maintain an up-to-date representation throughout the video duration.
arXiv Detail & Related papers (2022-04-28T13:35:14Z) - Group Contextualization for Video Recognition [80.3842253625557]
Group contextualization (GC) can boost the performance of 2D-CNN (e.g., TSN) and TSM.
GC embeds feature with four different kinds of contexts in parallel.
Group contextualization can boost the performance of 2D-CNN (e.g., TSN) to a level comparable to the state-the-art video networks.
arXiv Detail & Related papers (2022-03-18T01:49:40Z) - Video Is Graph: Structured Graph Module for Video Action Recognition [34.918667614077805]
We transform a video sequence into a graph to obtain direct long-term dependencies among temporal frames.
In particular, SGM divides the neighbors of each node into several temporal regions so as to extract global structural information.
The reported performance and analysis demonstrate that SGM can achieve outstanding precision with less computational complexity.
arXiv Detail & Related papers (2021-10-12T11:27:29Z) - TAM: Temporal Adaptive Module for Video Recognition [60.83208364110288]
temporal adaptive module (bf TAM) generates video-specific temporal kernels based on its own feature map.
Experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently.
arXiv Detail & Related papers (2020-05-14T08:22:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.