Group Contextualization for Video Recognition
- URL: http://arxiv.org/abs/2203.09694v1
- Date: Fri, 18 Mar 2022 01:49:40 GMT
- Title: Group Contextualization for Video Recognition
- Authors: Yanbin Hao, Hao Zhang, Chong-Wah Ngo and Xiangnan He
- Abstract summary: Group contextualization (GC) can boost the performance of 2D-CNN (e.g., TSN) and TSM.
GC embeds feature with four different kinds of contexts in parallel.
Group contextualization can boost the performance of 2D-CNN (e.g., TSN) to a level comparable to the state-the-art video networks.
- Score: 80.3842253625557
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning discriminative representation from the complex spatio-temporal
dynamic space is essential for video recognition. On top of those stylized
spatio-temporal computational units, further refining the learnt feature with
axial contexts is demonstrated to be promising in achieving this goal. However,
previous works generally focus on utilizing a single kind of contexts to
calibrate entire feature channels and could hardly apply to deal with diverse
video activities. The problem can be tackled by using pair-wise spatio-temporal
attentions to recompute feature response with cross-axis contexts at the
expense of heavy computations. In this paper, we propose an efficient feature
refinement method that decomposes the feature channels into several groups and
separately refines them with different axial contexts in parallel. We refer
this lightweight feature calibration as group contextualization (GC).
Specifically, we design a family of efficient element-wise calibrators, i.e.,
ECal-G/S/T/L, where their axial contexts are information dynamics aggregated
from other axes either globally or locally, to contextualize feature channel
groups. The GC module can be densely plugged into each residual layer of the
off-the-shelf video networks. With little computational overhead, consistent
improvement is observed when plugging in GC on different networks. By utilizing
calibrators to embed feature with four different kinds of contexts in parallel,
the learnt representation is expected to be more resilient to diverse types of
activities. On videos with rich temporal variations, empirically GC can boost
the performance of 2D-CNN (e.g., TSN and TSM) to a level comparable to the
state-of-the-art video networks. Code is available at
https://github.com/haoyanbin918/Group-Contextualization.
Related papers
- SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - SOC: Semantic-Assisted Object Cluster for Referring Video Object
Segmentation [35.063881868130075]
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment.
We propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment.
We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin.
arXiv Detail & Related papers (2023-05-26T15:13:44Z) - Learning Temporal Distribution and Spatial Correlation Towards Universal
Moving Object Segmentation [8.807766029291901]
We propose a method called Learning Temporal Distribution and Spatial Correlation (LTS) that has the potential to be a general solution for universal moving object segmentation.
In the proposed approach, the distribution from temporal pixels is first learned by our Defect Iterative Distribution Learning (DIDL) network for scene-independent segmentation.
The proposed approach performs well for almost all videos from diverse and complex natural scenes with fixed parameters.
arXiv Detail & Related papers (2023-04-19T20:03:09Z) - Attention in Attention: Modeling Context Correlation for Efficient Video
Classification [47.938500236792244]
This paper proposes an efficient attention-in-attention (AIA) method for focus-wise feature refinement.
We instantiate video feature contexts as dynamics aggregated along a specific axis with global average and pooling operations.
All the computational operations in attention units act on the pooled dimension, which results in quite few computational cost increase.
arXiv Detail & Related papers (2022-04-20T08:37:52Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z) - Temporal-attentive Covariance Pooling Networks for Video Recognition [52.853765492522655]
existing video architectures usually generate global representation by using a simple global average pooling (GAP) method.
This paper proposes a attentive Covariance Pooling( TCP- TCP), inserted at the end of deep architectures, to produce powerful video representations.
Our TCP is model-agnostic and can be flexibly integrated into any video architectures, resulting in TCPNet for effective video recognition.
arXiv Detail & Related papers (2021-10-27T12:31:29Z) - T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval [59.990432265734384]
Text-video retrieval is a challenging task that aims to search relevant video contents based on natural language descriptions.
Most existing methods only consider the global cross-modal similarity and overlook the local details.
In this paper, we design an efficient global-local alignment method.
We achieve consistent improvements on three standard text-video retrieval benchmarks and outperform the state-of-the-art by a clear margin.
arXiv Detail & Related papers (2021-04-20T15:26:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.