X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos
through Cross-modal Knowledge Transfer
- URL: http://arxiv.org/abs/2312.07378v1
- Date: Tue, 12 Dec 2023 15:48:12 GMT
- Title: X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos
through Cross-modal Knowledge Transfer
- Authors: Linglin Jing, Ying Xue, Xu Yan, Chaoda Zheng, Dong Wang, Ruimao Zhang,
Zhigang Wang, Hui Fang, Bin Zhao, Zhen Li
- Abstract summary: We propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer.
It enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining.
Experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks.
- Score: 28.719098240737605
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The field of 4D point cloud understanding is rapidly developing with the goal
of analyzing dynamic 3D point cloud sequences. However, it remains a
challenging task due to the sparsity and lack of texture in point clouds.
Moreover, the irregularity of point cloud poses a difficulty in aligning
temporal information within video sequences. To address these issues, we
propose a novel cross-modal knowledge transfer framework, called
X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring
texture priors from RGB sequences using a Transformer architecture with
temporal relationship mining. Specifically, the framework is designed with a
dual-branch architecture, consisting of an 4D point cloud transformer and a
Gradient-aware Image Transformer (GIT). During training, we employ multiple
knowledge transfer techniques, including temporal consistency losses and masked
self-attention, to strengthen the knowledge transfer between modalities. This
leads to enhanced performance during inference using single-modal 4D point
cloud inputs. Extensive experiments demonstrate the superior performance of our
framework on various 4D point cloud video understanding tasks, including action
recognition, action segmentation and semantic segmentation. The results achieve
1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action
segmentation and semantic segmentation, on the HOI4D
challenge\footnote{\url{http://www.hoi4d.top/}.}, outperforming previous
state-of-the-art by a large margin. We release the code at
https://github.com/jinglinglingling/X4D
Related papers
- Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models [54.35214051961381]
3D meshes are widely used in computer vision and graphics for their efficiency in animation and minimal memory use in movies, games, AR, and VR.
However, creating temporal consistent and realistic textures for mesh remains labor-intensive for professional artists.
We present 3D Tex sequences that integrates inherent geometry from mesh sequences with video diffusion models to produce consistent textures.
arXiv Detail & Related papers (2024-10-14T17:59:59Z) - Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis [60.853577108780414]
Existing 4D generation methods can generate high-quality 4D objects or scenes based on user-friendly conditions.
We propose Trans4D, a novel text-to-4D synthesis framework that enables realistic complex scene transitions.
In experiments, Trans4D consistently outperforms existing state-of-the-art methods in generating 4D scenes with accurate and high-quality transitions.
arXiv Detail & Related papers (2024-10-09T17:56:03Z) - CT4D: Consistent Text-to-4D Generation with Animatable Meshes [53.897244823604346]
We present a novel framework, coined CT4D, which directly operates on animatable meshes for generating consistent 4D content from arbitrary user-supplied prompts.
Our framework incorporates a unique Generate-Refine-Animate (GRA) algorithm to enhance the creation of text-aligned meshes.
Our experimental results, both qualitative and quantitative, demonstrate that our CT4D framework surpasses existing text-to-4D techniques in maintaining interframe consistency and preserving global geometry.
arXiv Detail & Related papers (2024-08-15T14:41:34Z) - Flow4D: Leveraging 4D Voxel Network for LiDAR Scene Flow Estimation [20.904903264632733]
Flow4D temporally fuses multiple point clouds after the 3D intra-voxel feature encoder.
Spatio-Temporal De Blockcomposition (STDB) combines 3D and 1D convolutions instead of using heavy 4D convolutions.
Flow4D achieves a 45.9% higher performance compared to the state-of-the-art while running in real-time.
arXiv Detail & Related papers (2024-07-10T18:55:43Z) - VG4D: Vision-Language Model Goes 4D Video Recognition [34.98194339741201]
Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can learn fine-grained visual concepts.
We propose the Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from visual-text pre-trained models to a 4D point cloud network.
arXiv Detail & Related papers (2024-04-17T17:54:49Z) - Comp4D: LLM-Guided Compositional 4D Scene Generation [65.5810466788355]
We present Comp4D, a novel framework for Compositional 4D Generation.
Unlike conventional methods that generate a singular 4D representation of the entire scene, Comp4D innovatively constructs each 4D object within the scene separately.
Our method employs a compositional score distillation technique guided by the pre-defined trajectories.
arXiv Detail & Related papers (2024-03-25T17:55:52Z) - 4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency [118.15258850780417]
This work introduces 4DGen, a novel framework for grounded 4D content creation.
We identify static 3D assets and monocular video sequences as key components in constructing the 4D content.
Our pipeline facilitates conditional 4D generation, enabling users to specify geometry (3D assets) and motion (monocular videos)
arXiv Detail & Related papers (2023-12-28T18:53:39Z) - Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud
Sequence Representation Learning [14.033085586047799]
This paper proposes a new 4D self-supervised pre-training method called Complete-to-Partial 4D Distillation.
Our key idea is to formulate 4D self-supervised representation learning as a teacher-student knowledge distillation framework.
Experiments show that this approach significantly outperforms previous pre-training approaches on a wide range of 4D point cloud sequence understanding tasks.
arXiv Detail & Related papers (2022-12-10T16:26:19Z) - X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D
Dense Captioning [71.36623596807122]
3D dense captioning aims to describe individual objects by natural language in 3D scenes, where 3D scenes are usually represented as RGB-D scans or point clouds.
In this study, we investigate a cross-modal knowledge transfer using Transformer for 3D dense captioning, X-Trans2Cap, to effectively boost the performance of single-modal 3D caption.
arXiv Detail & Related papers (2022-03-02T03:35:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.