Related papers: Flow4D: Leveraging 4D Voxel Network for LiDAR Scene Flow Estimation

Flow4D: Leveraging 4D Voxel Network for LiDAR Scene Flow Estimation

URL: http://arxiv.org/abs/2407.07995v1
Date: Wed, 10 Jul 2024 18:55:43 GMT
Title: Flow4D: Leveraging 4D Voxel Network for LiDAR Scene Flow Estimation
Authors: Jaeyeul Kim, Jungwan Woo, Ukcheol Shin, Jean Oh, Sunghoon Im,
Abstract summary: Flow4D temporally fuses multiple point clouds after the 3D intra-voxel feature encoder. Spatio-Temporal De Blockcomposition (STDB) combines 3D and 1D convolutions instead of using heavy 4D convolutions. Flow4D achieves a 45.9% higher performance compared to the state-of-the-art while running in real-time.
Score: 20.904903264632733
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding the motion states of the surrounding environment is critical for safe autonomous driving. These motion states can be accurately derived from scene flow, which captures the three-dimensional motion field of points. Existing LiDAR scene flow methods extract spatial features from each point cloud and then fuse them channel-wise, resulting in the implicit extraction of spatio-temporal features. Furthermore, they utilize 2D Bird's Eye View and process only two frames, missing crucial spatial information along the Z-axis and the broader temporal context, leading to suboptimal performance. To address these limitations, we propose Flow4D, which temporally fuses multiple point clouds after the 3D intra-voxel feature encoder, enabling more explicit extraction of spatio-temporal features through a 4D voxel network. However, while using 4D convolution improves performance, it significantly increases the computational load. For further efficiency, we introduce the Spatio-Temporal Decomposition Block (STDB), which combines 3D and 1D convolutions instead of using heavy 4D convolution. In addition, Flow4D further improves performance by using five frames to take advantage of richer temporal information. As a result, the proposed method achieves a 45.9% higher performance compared to the state-of-the-art while running in real-time, and won 1st place in the 2024 Argoverse 2 Scene Flow Challenge. The code is available at https://github.com/dgist-cvlab/Flow4D.

Related papers

LLaVA-4D: Embedding SpatioTemporal Prompt into LMMs for 4D Scene Understanding [55.81291976637705]
We propose a general LMM framework with atemporal prompt for visual representation 4D scene understanding.<n>The prompt is generated by encoding 3D position and 1D time into dynamic-aware 4D coordinate embedding.<n>Experiments have been conducted to demonstrate the effectiveness of our method across different tasks in 4D scene understanding.
arXiv Detail & Related papers (2025-05-18T06:18:57Z)
Disentangled 4D Gaussian Splatting: Towards Faster and More Efficient Dynamic Scene Rendering [12.27734287104036]
Novel-entangleview synthesis (NVS) for dynamic scenes from 2D images presents significant challenges. We introduce Disentangled 4D Gaussianting (Disentangled4DGS), a novel representation and rendering approach that disentangles temporal and spatial deformations. Our approach achieves an unprecedented average rendering speed of 343 FPS at a resolution of $1352times1014$ on a 3090 GPU.
arXiv Detail & Related papers (2025-03-28T05:46:02Z)
SuperFlow++: Enhanced Spatiotemporal Consistency for Cross-Modal Data Pretraining [62.433137130087445]
SuperFlow++ is a novel framework that integrates pretraining and downstream tasks using consecutive camera pairs. We show that SuperFlow++ outperforms state-of-the-art methods across diverse tasks and driving conditions. With strong generalizability and computational efficiency, SuperFlow++ establishes a new benchmark for data-efficient LiDAR-based perception in autonomous driving.
arXiv Detail & Related papers (2025-03-25T17:59:57Z)
Driv3R: Learning Dense 4D Reconstruction for Autonomous Driving [116.10577967146762]
We propose Driv3R, a framework that directly regresses per-frame point maps from multi-view image sequences. We employ a 4D flow predictor to identify moving objects within the scene to direct our network focus more on reconstructing these dynamic regions. Driv3R outperforms previous frameworks in 4D dynamic scene reconstruction, achieving 15x faster inference speed.
arXiv Detail & Related papers (2024-12-09T18:58:03Z)
Deblur4DGS: 4D Gaussian Splatting from Blurry Monocular Video [64.38566659338751]
We propose the first 4D Gaussian Splatting framework to reconstruct a high-quality 4D model from blurry monocular video, named Deblur4DGS. We introduce exposure regularization to avoid trivial solutions, as well as multi-frame and multi-resolution consistency ones to alleviate artifacts. Beyond novel-view, Deblur4DGS can be applied to improve blurry video from multiple perspectives, including deblurring, frame synthesis, and video stabilization.
arXiv Detail & Related papers (2024-12-09T12:02:11Z)
Dynamics-Aware Gaussian Splatting Streaming Towards Fast On-the-Fly Training for 4D Reconstruction [12.111389926333592]
Current 3DGS-based streaming methods treat the Gaussian primitives uniformly and constantly renew the densified Gaussians. We propose a novel three-stage pipeline for iterative streamable 4D dynamic spatial reconstruction. Our method achieves state-of-the-art performance in online 4D reconstruction, demonstrating a 20% improvement in on-the-fly training speed, superior representation quality, and real-time rendering capability.
arXiv Detail & Related papers (2024-11-22T10:47:47Z)
S4D: Streaming 4D Real-World Reconstruction with Gaussians and 3D Control Points [30.46796069720543]
We introduce a novel approach for streaming 4D real-world reconstruction utilizing discrete 3D control points. This method physically models local rays and establishes a motion-decoupling coordinate system. By effectively merging traditional graphics with learnable pipelines, it provides a robust and efficient local 6-degrees-of-freedom (6 DoF) motion representation.
arXiv Detail & Related papers (2024-08-23T12:51:49Z)
DeFlow: Decoder of Scene Flow Network in Autonomous Driving [19.486167661795797]
Scene flow estimation determines a scene's 3D motion field, by predicting the motion of points in the scene. Many networks with large-scale point clouds as input use voxelization to create a pseudo-image for real-time running. Our paper introduces DeFlow which enables a transition from voxel-based features to point features using Gated Recurrent Unit (GRU) refinement.
arXiv Detail & Related papers (2024-01-29T12:47:55Z)
Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking [52.393359791978035]
Motion2VecSets is a 4D diffusion model for dynamic surface reconstruction from point cloud sequences. We parameterize 4D dynamics with latent sets instead of using global latent codes. For more temporally-coherent object tracking, we synchronously denoise deformation latent sets and exchange information across multiple frames.
arXiv Detail & Related papers (2024-01-12T15:05:08Z)
4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency [118.15258850780417]
This work introduces 4DGen, a novel framework for grounded 4D content creation. We identify static 3D assets and monocular video sequences as key components in constructing the 4D content. Our pipeline facilitates conditional 4D generation, enabling users to specify geometry (3D assets) and motion (monocular videos)
arXiv Detail & Related papers (2023-12-28T18:53:39Z)
X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-modal Knowledge Transfer [28.719098240737605]
We propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. It enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks.
arXiv Detail & Related papers (2023-12-12T15:48:12Z)
NeRFPlayer: A Streamable Dynamic Scene Representation with Decomposed Neural Radiance Fields [99.57774680640581]
We present an efficient framework capable of fast reconstruction, compact modeling, and streamable rendering. We propose to decompose the 4D space according to temporal characteristics. Points in the 4D space are associated with probabilities belonging to three categories: static, deforming, and new areas.
arXiv Detail & Related papers (2022-10-28T07:11:05Z)
Learning Spatial and Temporal Variations for 4D Point Cloud Segmentation [0.39373541926236766]
We argue that the temporal information across the frames provides crucial knowledge for 3D scene perceptions. We design a temporal variation-aware module and a temporal voxel-point refiner to capture the temporal variation in the 4D point cloud.
arXiv Detail & Related papers (2022-07-11T07:36:26Z)
DS-Net: Dynamic Spatiotemporal Network for Video Salient Object Detection [78.04869214450963]
We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information. We show that the proposed method achieves superior performance than state-of-the-art algorithms.
arXiv Detail & Related papers (2020-12-09T06:42:30Z)
A Real-time Action Representation with Temporal Encoding and Deep Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)
V4D:4D Convolutional Neural Networks for Video-level Representation Learning [58.548331848942865]
Most 3D CNNs for video representation learning are clip-based, and thus do not consider video-temporal evolution of features. We propose Video-level 4D Conal Neural Networks, or V4D, to model long-range representation with 4D convolutions. V4D achieves excellent results, surpassing recent 3D CNNs by a large margin.
arXiv Detail & Related papers (2020-02-18T09:27:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.