Related papers: Is 3D Convolution with 5D Tensors Really Necessary for Video Analysis?

Is 3D Convolution with 5D Tensors Really Necessary for Video Analysis?

URL: http://arxiv.org/abs/2407.16514v1
Date: Tue, 23 Jul 2024 14:30:51 GMT
Title: Is 3D Convolution with 5D Tensors Really Necessary for Video Analysis?
Authors: Habib Hajimolahoseini, Walid Ahmed, Austin Wen, Yang Liu,
Abstract summary: We present several novel techniques for implementing 3D convolutional blocks using 2D and/or 1D convolutions with only 4D and/or 3D tensors. Our motivation is that 3D convolutions with 5D tensors are computationally expensive and they may not be supported by some of the edge devices used in real-time applications such as robots.
Score: 4.817356884702073
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In this paper, we present a comprehensive study and propose several novel techniques for implementing 3D convolutional blocks using 2D and/or 1D convolutions with only 4D and/or 3D tensors. Our motivation is that 3D convolutions with 5D tensors are computationally very expensive and they may not be supported by some of the edge devices used in real-time applications such as robots. The existing approaches mitigate this by splitting the 3D kernels into spatial and temporal domains, but they still use 3D convolutions with 5D tensors in their implementations. We resolve this issue by introducing some appropriate 4D/3D tensor reshaping as well as new combination techniques for spatial and temporal splits. The proposed implementation methods show significant improvement both in terms of efficiency and accuracy. The experimental results confirm that the proposed spatio-temporal processing structure outperforms the original model in terms of speed and accuracy using only 4D tensors with fewer parameters.

Related papers

GaussRender: Learning 3D Occupancy with Gaussian Rendering [86.89653628311565]
GaussRender is a module that improves 3D occupancy learning by enforcing projective consistency. Our method penalizes 3D configurations that produce inconsistent 2D projections, thereby enforcing a more coherent 3D structure.
arXiv Detail & Related papers (2025-02-07T16:07:51Z)
Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding. We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z)
Hash3D: Training-free Acceleration for 3D Generation [72.88137795439407]
Hash3D is a universal acceleration for 3D generation without model training. By effectively hashing and reusing feature maps across neighboring timesteps and camera angles, Hash3D substantially prevents redundant calculations. Surprisingly, this feature-sharing mechanism not only speed up the generation but also enhances the smoothness and view consistency of the synthesized 3D objects.
arXiv Detail & Related papers (2024-04-09T07:49:30Z)
LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation [73.36690511083894]
This paper introduces a novel framework called LN3Diff to address a unified 3D diffusion pipeline. Our approach harnesses a 3D-aware architecture and variational autoencoder to encode the input image into a structured, compact, and 3D latent space. It achieves state-of-the-art performance on ShapeNet for 3D generation and demonstrates superior performance in monocular 3D reconstruction and conditional 3D generation.
arXiv Detail & Related papers (2024-03-18T17:54:34Z)
PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation. For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z)
Smaller3d: Smaller Models for 3D Semantic Segmentation Using Minkowski Engine and Knowledge Distillation Methods [0.0]
This paper proposes the application of knowledge distillation techniques, especially for sparse tensors in 3D deep learning, to reduce model sizes while maintaining performance. We analyze and purpose different loss functions, including standard methods and combinations of various losses, to simulate the performance of state-of-the-art models of different Sparse Convolutional NNs.
arXiv Detail & Related papers (2023-05-04T22:19:25Z)
TR3D: Towards Real-Time Indoor 3D Object Detection [6.215404942415161]
TR3D is a fully-convolutional 3D object detection model trained end-to-end. To take advantage of both point cloud and RGB inputs, we introduce an early fusion of 2D and 3D features. Our model with early feature fusion, which we refer to as TR3D+FF, outperforms existing 3D object detection approaches on the SUN RGB-D dataset.
arXiv Detail & Related papers (2023-02-06T15:25:50Z)
T4DT: Tensorizing Time for Learning Temporal 3D Visual Data [19.418308324435916]
We show that low-rank tensor compression is extremely compact to store and query time-varying signed distance functions. Unlike existing iterative learning-based approaches like DeepSDF and NeRF, our method uses a closed-form algorithm with theoretical guarantees.
arXiv Detail & Related papers (2022-08-02T12:57:08Z)
Asymmetric 3D Context Fusion for Universal Lesion Detection [55.61873234187917]
3D networks are strong in 3D context yet lack supervised pretraining. Existing 3D context fusion operators are designed to be spatially symmetric, performing identical operations on each 2D slice like convolutions. We propose a novel asymmetric 3D context fusion operator (A3D), which uses different weights to fuse 3D context from different 2D slices.
arXiv Detail & Related papers (2021-09-17T16:25:10Z)
FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection [78.00922683083776]
It is non-trivial to make a general adapted 2D detector work in this 3D task. In this technical report, we study this problem with a practice built on fully convolutional single-stage detector. Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020.
arXiv Detail & Related papers (2021-04-22T09:35:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.