Learnable Sampling 3D Convolution for Video Enhancement and Action
Recognition
- URL: http://arxiv.org/abs/2011.10974v1
- Date: Sun, 22 Nov 2020 09:20:49 GMT
- Title: Learnable Sampling 3D Convolution for Video Enhancement and Action
Recognition
- Authors: Shuyang Gu, Jianmin Bao, Dong Chen
- Abstract summary: We introduce a new module to improve the capability of 3D convolution (emphLS3D-Conv)
We add learnable 2D offsets to 3D convolution which aims to sample locations on spatial feature maps across frames.
The experiments on video, video super-resolution, video denoising, and action recognition demonstrate the effectiveness of our approach.
- Score: 24.220358793070965
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A key challenge in video enhancement and action recognition is to fuse useful
information from neighboring frames. Recent works suggest establishing accurate
correspondences between neighboring frames before fusing temporal information.
However, the generated results heavily depend on the quality of correspondence
estimation. In this paper, we propose a more robust solution: \emph{sampling
and fusing multi-level features} across neighborhood frames to generate the
results. Based on this idea, we introduce a new module to improve the
capability of 3D convolution, namely, learnable sampling 3D convolution
(\emph{LS3D-Conv}). We add learnable 2D offsets to 3D convolution which aims to
sample locations on spatial feature maps across frames. The offsets can be
learned for specific tasks. The \emph{LS3D-Conv} can flexibly replace 3D
convolution layers in existing 3D networks and get new architectures, which
learns the sampling at multiple feature levels. The experiments on video
interpolation, video super-resolution, video denoising, and action recognition
demonstrate the effectiveness of our approach.
Related papers
- ULIP: Learning a Unified Representation of Language, Images, and Point
Clouds for 3D Understanding [110.07170245531464]
Current 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories.
Recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language.
We learn a unified representation of images, texts, and 3D point clouds by pre-training with object triplets from the three modalities.
arXiv Detail & Related papers (2022-12-10T01:34:47Z) - 3D-CSL: self-supervised 3D context similarity learning for
Near-Duplicate Video Retrieval [17.69904571043164]
We introduce 3D-SL, a compact pipeline for Near-Duplicate Video Retrieval (NDVR)
We propose a two-stage self-supervised similarity learning strategy to optimize the network.
Our method achieves the state-of-the-art performance on clip-level NDVR.
arXiv Detail & Related papers (2022-11-10T05:51:08Z) - Focal Sparse Convolutional Networks for 3D Object Detection [121.45950754511021]
We introduce two new modules to enhance the capability of Sparse CNNs.
They are focal sparse convolution (Focals Conv) and its multi-modal variant of focal sparse convolution with fusion.
For the first time, we show that spatially learnable sparsity in sparse convolution is essential for sophisticated 3D object detection.
arXiv Detail & Related papers (2022-04-26T17:34:10Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - 2D or not 2D? Adaptive 3D Convolution Selection for Efficient Video
Recognition [84.697097472401]
We introduce Ada3D, a conditional computation framework that learns instance-specific 3D usage policies to determine frames and convolution layers to be used in a 3D network.
We demonstrate that our method achieves similar accuracies to state-of-the-art 3D models while requiring 20%-50% less computation across different datasets.
arXiv Detail & Related papers (2020-12-29T21:40:38Z) - Making a Case for 3D Convolutions for Object Segmentation in Videos [16.167397418720483]
We show that 3D convolutional networks can be effectively applied to dense video prediction tasks such as salient object segmentation.
We propose a 3D decoder architecture, that comprises novel 3D Global Convolution layers and 3D Refinement modules.
Our approach outperforms existing state-of-the-arts by a large margin on the DAVIS'16 Unsupervised, FBMS and ViSal benchmarks.
arXiv Detail & Related papers (2020-08-26T12:24:23Z) - Appearance-Preserving 3D Convolution for Video-based Person
Re-identification [61.677153482995564]
We propose AppearancePreserving 3D Convolution (AP3D), which is composed of two components: an Appearance-Preserving Module (APM) and a 3D convolution kernel.
It is easy to combine AP3D with existing 3D ConvNets by simply replacing the original 3D convolution kernels with AP3Ds.
arXiv Detail & Related papers (2020-07-16T16:21:34Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.