Related papers: V4D:4D Convolutional Neural Networks for Video-level Representation Learning

V4D:4D Convolutional Neural Networks for Video-level Representation Learning

URL: http://arxiv.org/abs/2002.07442v1
Date: Tue, 18 Feb 2020 09:27:41 GMT
Title: V4D:4D Convolutional Neural Networks for Video-level Representation Learning
Authors: Shiwen Zhang and Sheng Guo and Weilin Huang and Matthew R. Scott and Limin Wang
Abstract summary: Most 3D CNNs for video representation learning are clip-based, and thus do not consider video-temporal evolution of features. We propose Video-level 4D Conal Neural Networks, or V4D, to model long-range representation with 4D convolutions. V4D achieves excellent results, surpassing recent 3D CNNs by a large margin.
Score: 58.548331848942865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most existing 3D CNNs for video representation learning are clip-based methods, and thus do not consider video-level temporal evolution of spatio-temporal features. In this paper, we propose Video-level 4D Convolutional Neural Networks, referred as V4D, to model the evolution of long-range spatio-temporal representation with 4D convolutions, and at the same time, to preserve strong 3D spatio-temporal representation with residual connections. Specifically, we design a new 4D residual block able to capture inter-clip interactions, which could enhance the representation power of the original clip-level 3D CNNs. The 4D residual blocks can be easily integrated into the existing 3D CNNs to perform long-range modeling hierarchically. We further introduce the training and inference methods for the proposed V4D. Extensive experiments are conducted on three video recognition benchmarks, where V4D achieves excellent results, surpassing recent 3D CNNs by a large margin.

Related papers

Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction [72.54905331756076]
We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic prior captured by such video models, Geo4D can be trained using only synthetic data.
arXiv Detail & Related papers (2025-04-10T17:59:55Z)
Diffusion4D: Fast Spatial-temporal Consistent 4D Generation via Video Diffusion Models [116.31344506738816]
We present a novel framework, textbfDiffusion4D, for efficient and scalable 4D content generation. We develop a 4D-aware video diffusion model capable of synthesizing orbital views of dynamic 3D assets. Our method surpasses prior state-of-the-art techniques in terms of generation efficiency and 4D geometry consistency.
arXiv Detail & Related papers (2024-05-26T17:47:34Z)
Beyond Skeletons: Integrative Latent Mapping for Coherent 4D Sequence Generation [48.671462912294594]
We propose a novel framework that generates coherent 4D sequences with animation of 3D shapes under given conditions. We first employ an integrative latent unified representation to encode shape and color information of each detailed 3D geometry frame. The proposed skeleton-free latent 4D sequence joint representation allows us to leverage diffusion models in a low-dimensional space to control the generation of 4D sequences.
arXiv Detail & Related papers (2024-03-20T01:59:43Z)
Efficient4D: Fast Dynamic 3D Object Generation from a Single-view Video [42.10482273572879]
We propose an efficient video-to-4D object generation framework called Efficient4D. It generates high-quality spacetime-consistent images under different camera views, and then uses them as labeled data. Experiments on both synthetic and real videos show that Efficient4D offers a remarkable 10-fold increase in speed.
arXiv Detail & Related papers (2024-01-16T18:58:36Z)
4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency [118.15258850780417]
This work introduces 4DGen, a novel framework for grounded 4D content creation. We identify static 3D assets and monocular video sequences as key components in constructing the 4D content. Our pipeline facilitates conditional 4D generation, enabling users to specify geometry (3D assets) and motion (monocular videos)
arXiv Detail & Related papers (2023-12-28T18:53:39Z)
F4D: Factorized 4D Convolutional Neural Network for Efficient Video-level Representation Learning [4.123763595394021]
Most existing 3D convolutional neural network (CNN)-based methods for video-level representation learning are clip-based. We propose a factorized 4D CNN architecture with attention (F4D) that is capable of learning more effective, finer-grained, long-termtemporal video representations.
arXiv Detail & Related papers (2023-11-28T19:21:57Z)
Consistent4D: Consistent 360{\deg} Dynamic Object Generation from Monocular Video [15.621374353364468]
Consistent4D is a novel approach for generating 4D dynamic objects from uncalibrated monocular videos. We cast the 360-degree dynamic object reconstruction as a 4D generation problem, eliminating the need for tedious multi-view data collection and camera calibration.
arXiv Detail & Related papers (2023-11-06T03:26:43Z)
Learning Parallel Dense Correspondence from Spatio-Temporal Descriptors for Efficient and Robust 4D Reconstruction [43.60322886598972]
This paper focuses on the task of 4D shape reconstruction from a sequence of point clouds. We present a novel pipeline to learn a temporal evolution of the 3D human shape through capturing continuous transformation functions among cross-frame occupancy fields.
arXiv Detail & Related papers (2021-03-30T13:36:03Z)
Learning Compositional Representation for 4D Captures with Neural ODE [72.56606274691033]
We introduce a compositional representation for 4D captures, that disentangles shape, initial state, and motion respectively. To model the motion, a neural Ordinary Differential Equation (ODE) is trained to update the initial state conditioned on the learned motion code. A decoder takes the shape code and the updated pose code to reconstruct 4D captures at each time stamp.
arXiv Detail & Related papers (2021-03-15T10:55:55Z)
A Real-time Action Representation with Temporal Encoding and Deep Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.