X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos
through Cross-modal Knowledge Transfer
- URL: http://arxiv.org/abs/2312.07378v1
- Date: Tue, 12 Dec 2023 15:48:12 GMT
- Title: X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos
through Cross-modal Knowledge Transfer
- Authors: Linglin Jing, Ying Xue, Xu Yan, Chaoda Zheng, Dong Wang, Ruimao Zhang,
Zhigang Wang, Hui Fang, Bin Zhao, Zhen Li
- Abstract summary: We propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer.
It enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining.
Experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks.
- Score: 28.719098240737605
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The field of 4D point cloud understanding is rapidly developing with the goal
of analyzing dynamic 3D point cloud sequences. However, it remains a
challenging task due to the sparsity and lack of texture in point clouds.
Moreover, the irregularity of point cloud poses a difficulty in aligning
temporal information within video sequences. To address these issues, we
propose a novel cross-modal knowledge transfer framework, called
X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring
texture priors from RGB sequences using a Transformer architecture with
temporal relationship mining. Specifically, the framework is designed with a
dual-branch architecture, consisting of an 4D point cloud transformer and a
Gradient-aware Image Transformer (GIT). During training, we employ multiple
knowledge transfer techniques, including temporal consistency losses and masked
self-attention, to strengthen the knowledge transfer between modalities. This
leads to enhanced performance during inference using single-modal 4D point
cloud inputs. Extensive experiments demonstrate the superior performance of our
framework on various 4D point cloud video understanding tasks, including action
recognition, action segmentation and semantic segmentation. The results achieve
1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action
segmentation and semantic segmentation, on the HOI4D
challenge\footnote{\url{http://www.hoi4d.top/}.}, outperforming previous
state-of-the-art by a large margin. We release the code at
https://github.com/jinglinglingling/X4D
Related papers
- Flow4D: Leveraging 4D Voxel Network for LiDAR Scene Flow Estimation [20.904903264632733]
Flow4D temporally fuses multiple point clouds after the 3D intra-voxel feature encoder.
Spatio-Temporal De Blockcomposition (STDB) combines 3D and 1D convolutions instead of using heavy 4D convolutions.
Flow4D achieves a 45.9% higher performance compared to the state-of-the-art while running in real-time.
arXiv Detail & Related papers (2024-07-10T18:55:43Z) - GFlow: Recovering 4D World from Monocular Video [58.63051670458107]
We introduce GFlow, a framework that lifts a video (3D) to a 4D explicit representation, entailing a flow of Gaussian splatting through space and time.
GFlow first clusters the scene into still and moving parts, then applies a sequential optimization process.
GFlow transcends the boundaries of mere 4D reconstruction.
arXiv Detail & Related papers (2024-05-28T17:59:22Z) - EG4D: Explicit Generation of 4D Object without Score Distillation [105.63506584772331]
DG4D is a novel framework that generates high-quality and consistent 4D assets without score distillation.
Our framework outperforms the baselines in generation quality by a considerable margin.
arXiv Detail & Related papers (2024-05-28T12:47:22Z) - MAMBA4D: Efficient Long-Sequence Point Cloud Video Understanding with Disentangled Spatial-Temporal State Space Models [14.024240637175216]
We propose a novel 4D point cloud video understanding backbone based on the recently advanced State Space Models (SSMs)
Specifically, our backbone begins by disentangling space and time in raw 4D geometries, and then establishing semantic-temporal videos.
Our method has an 87.5% memory reduction, 5.36 times speedup, and much higher accuracy (up to +104%) compared with transformer-based counterparts MS3D.
arXiv Detail & Related papers (2024-05-23T09:08:09Z) - VG4D: Vision-Language Model Goes 4D Video Recognition [34.98194339741201]
Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can learn fine-grained visual concepts.
We propose the Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from visual-text pre-trained models to a 4D point cloud network.
arXiv Detail & Related papers (2024-04-17T17:54:49Z) - Comp4D: LLM-Guided Compositional 4D Scene Generation [65.5810466788355]
We present Comp4D, a novel framework for Compositional 4D Generation.
Unlike conventional methods that generate a singular 4D representation of the entire scene, Comp4D innovatively constructs each 4D object within the scene separately.
Our method employs a compositional score distillation technique guided by the pre-defined trajectories.
arXiv Detail & Related papers (2024-03-25T17:55:52Z) - 4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency [118.15258850780417]
This work introduces 4DGen, a novel framework for grounded 4D content creation.
We identify static 3D assets and monocular video sequences as key components in constructing the 4D content.
Our pipeline facilitates conditional 4D generation, enabling users to specify geometry (3D assets) and motion (monocular videos)
arXiv Detail & Related papers (2023-12-28T18:53:39Z) - Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from
a Single Image [59.18564636990079]
We study the problem of synthesizing a long-term dynamic video from only a single image.
Existing methods either hallucinate inconsistent perpetual views or struggle with long camera trajectories.
We present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image.
arXiv Detail & Related papers (2023-08-20T12:53:50Z) - Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud
Sequence Representation Learning [14.033085586047799]
This paper proposes a new 4D self-supervised pre-training method called Complete-to-Partial 4D Distillation.
Our key idea is to formulate 4D self-supervised representation learning as a teacher-student knowledge distillation framework.
Experiments show that this approach significantly outperforms previous pre-training approaches on a wide range of 4D point cloud sequence understanding tasks.
arXiv Detail & Related papers (2022-12-10T16:26:19Z) - X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D
Dense Captioning [71.36623596807122]
3D dense captioning aims to describe individual objects by natural language in 3D scenes, where 3D scenes are usually represented as RGB-D scans or point clouds.
In this study, we investigate a cross-modal knowledge transfer using Transformer for 3D dense captioning, X-Trans2Cap, to effectively boost the performance of single-modal 3D caption.
arXiv Detail & Related papers (2022-03-02T03:35:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.