GCF-Net: Gated Clip Fusion Network for Video Action Recognition
- URL: http://arxiv.org/abs/2102.01285v1
- Date: Tue, 2 Feb 2021 03:51:55 GMT
- Title: GCF-Net: Gated Clip Fusion Network for Video Action Recognition
- Authors: Jenhao Hsiao and Jiawei Chen and Chiuman Ho
- Abstract summary: We introduce the Gated Clip Fusion Network (GCF-Net) for video action recognition.
GCF-Net explicitly models the inter-dependencies between video clips to strengthen the receptive field of local clip descriptors.
On a large benchmark dataset (Kinetics-600), the proposed GCF-Net elevates the accuracy of existing action classifiers by 11.49%.
- Score: 11.945392734711056
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, most of the accuracy gains for video action recognition have
come from the newly designed CNN architectures (e.g., 3D-CNNs). These models
are trained by applying a deep CNN on single clip of fixed temporal length.
Since each video segment are processed by the 3D-CNN module separately, the
corresponding clip descriptor is local and the inter-clip relationships are
inherently implicit. Common method that directly averages the clip-level
outputs as a video-level prediction is prone to fail due to the lack of
mechanism that can extract and integrate relevant information to represent the
video.
In this paper, we introduce the Gated Clip Fusion Network (GCF-Net) that can
greatly boost the existing video action classifiers with the cost of a tiny
computation overhead. The GCF-Net explicitly models the inter-dependencies
between video clips to strengthen the receptive field of local clip
descriptors. Furthermore, the importance of each clip to an action event is
calculated and a relevant subset of clips is selected accordingly for a
video-level analysis. On a large benchmark dataset (Kinetics-600), the proposed
GCF-Net elevates the accuracy of existing action classifiers by 11.49% (based
on central clip) and 3.67% (based on densely sampled clips) respectively.
Related papers
- CSTA: CNN-based Spatiotemporal Attention for Video Summarization [0.24578723416255752]
We propose a CNN-based SpatioTemporal Attention (CSTA) method that stacks each feature of frames from a single video to form image-like frame representations.
Our methodology relies on CNN to comprehend the inter and intra-frame relations and to find crucial attributes in videos by exploiting its ability to learn absolute positions within images.
arXiv Detail & Related papers (2024-05-20T09:38:37Z) - GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient
Partially Relevant Video Retrieval [59.47258928867802]
Given a text query, partially relevant video retrieval (PRVR) seeks to find videos containing pertinent moments in a database.
This paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly.
Experiments on three large-scale video datasets demonstrate the superiority and efficiency of GMMFormer.
arXiv Detail & Related papers (2023-10-08T15:04:50Z) - Spatio-temporal Co-attention Fusion Network for Video Splicing
Localization [2.3838507844983248]
A three-stream network is used as encoder to capture manipulation traces across multiple frames.
A lightweight multilayer perceptron (MLP) decoder is adopted to yield a pixel-level tampering localization map.
A new large-scale video splicing is created for training the SCFNet.
arXiv Detail & Related papers (2023-09-18T04:46:30Z) - Per-Clip Video Object Segmentation [110.08925274049409]
Recently, memory-based approaches show promising results on semisupervised video object segmentation.
We treat video object segmentation as clip-wise mask-wise propagation.
We propose a new method tailored for the per-clip inference.
arXiv Detail & Related papers (2022-08-03T09:02:29Z) - Video K-Net: A Simple, Strong, and Unified Baseline for Video
Segmentation [85.08156742410527]
Video K-Net is a framework for end-to-end video panoptic segmentation.
It unifies image segmentation via a group of learnable kernels.
Video K-Net learns to simultaneously segment and track "things" and "stuff"
arXiv Detail & Related papers (2022-04-10T11:24:47Z) - Action Keypoint Network for Efficient Video Recognition [63.48422805355741]
This paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net)
AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of action keypoints and then transforms the video recognition into point cloud classification.
Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.
arXiv Detail & Related papers (2022-01-17T09:35:34Z) - Skimming and Scanning for Untrimmed Video Action Recognition [44.70501912319826]
Untrimmed videos have redundant and diverse clips containing contextual information.
We propose a simple yet effective clip-level solution based on skim-scan techniques.
Our solution surpasses the state-of-the-art performance in terms of both accuracy and efficiency.
arXiv Detail & Related papers (2021-04-21T12:23:44Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features.
We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.