Related papers: X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-modal Knowledge Transfer

X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-modal Knowledge Transfer

URL: http://arxiv.org/abs/2312.07378v1
Date: Tue, 12 Dec 2023 15:48:12 GMT
Title: X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos through Cross-modal Knowledge Transfer
Authors: Linglin Jing, Ying Xue, Xu Yan, Chaoda Zheng, Dong Wang, Ruimao Zhang, Zhigang Wang, Hui Fang, Bin Zhao, Zhen Li
Abstract summary: We propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. It enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks.
Score: 28.719098240737605
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The field of 4D point cloud understanding is rapidly developing with the goal of analyzing dynamic 3D point cloud sequences. However, it remains a challenging task due to the sparsity and lack of texture in point clouds. Moreover, the irregularity of point cloud poses a difficulty in aligning temporal information within video sequences. To address these issues, we propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer. This framework enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining. Specifically, the framework is designed with a dual-branch architecture, consisting of an 4D point cloud transformer and a Gradient-aware Image Transformer (GIT). During training, we employ multiple knowledge transfer techniques, including temporal consistency losses and masked self-attention, to strengthen the knowledge transfer between modalities. This leads to enhanced performance during inference using single-modal 4D point cloud inputs. Extensive experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks, including action recognition, action segmentation and semantic segmentation. The results achieve 1st places, i.e., 85.3% (+7.9%) accuracy and 47.3% (+5.0%) mIoU for 4D action segmentation and semantic segmentation, on the HOI4D challenge\footnote{\url{http://www.hoi4d.top/}.}, outperforming previous state-of-the-art by a large margin. We release the code at https://github.com/jinglinglingling/X4D

Related papers

Uni4D: A Unified Self-Supervised Learning Framework for Point Cloud Videos [48.8325946928959]
We introduce the first self-disentangled MAE for learning discriminative 4D representations in the pre-training stage. We demonstrate that it can boost the fine-tuning performance on all 4D tasks, which we term Uni4D.
arXiv Detail & Related papers (2025-04-07T08:47:36Z)
GS-DiT: Advancing Video Generation with Pseudo 4D Gaussian Fields through Efficient Dense 3D Point Tracking [38.104532522698285]
Training a video Diffusion Transformer (DiT) directly to control 4D content requires expensive multi-view videos. Inspired by Monocular Dynamic novel View Synthesis (MDVS), we bring pseudo 4D Gaussian fields to video generation. We finetune a pretrained DiT to generate videos following the guidance of the rendered video, dubbed as GS-DiT.
arXiv Detail & Related papers (2025-01-05T23:55:33Z)
Tex4D: Zero-shot 4D Scene Texturing with Video Diffusion Models [54.35214051961381]
3D meshes are widely used in computer vision and graphics for their efficiency in animation and minimal memory use in movies, games, AR, and VR. However, creating temporal consistent and realistic textures for mesh remains labor-intensive for professional artists. We present 3D Tex sequences that integrates inherent geometry from mesh sequences with video diffusion models to produce consistent textures.
arXiv Detail & Related papers (2024-10-14T17:59:59Z)
Trans4D: Realistic Geometry-Aware Transition for Compositional Text-to-4D Synthesis [60.853577108780414]
Existing 4D generation methods can generate high-quality 4D objects or scenes based on user-friendly conditions. We propose Trans4D, a novel text-to-4D synthesis framework that enables realistic complex scene transitions. In experiments, Trans4D consistently outperforms existing state-of-the-art methods in generating 4D scenes with accurate and high-quality transitions.
arXiv Detail & Related papers (2024-10-09T17:56:03Z)
CT4D: Consistent Text-to-4D Generation with Animatable Meshes [53.897244823604346]
We present a novel framework, coined CT4D, which directly operates on animatable meshes for generating consistent 4D content from arbitrary user-supplied prompts. Our framework incorporates a unique Generate-Refine-Animate (GRA) algorithm to enhance the creation of text-aligned meshes. Our experimental results, both qualitative and quantitative, demonstrate that our CT4D framework surpasses existing text-to-4D techniques in maintaining interframe consistency and preserving global geometry.
arXiv Detail & Related papers (2024-08-15T14:41:34Z)
Flow4D: Leveraging 4D Voxel Network for LiDAR Scene Flow Estimation [20.904903264632733]
Flow4D temporally fuses multiple point clouds after the 3D intra-voxel feature encoder. Spatio-Temporal De Blockcomposition (STDB) combines 3D and 1D convolutions instead of using heavy 4D convolutions. Flow4D achieves a 45.9% higher performance compared to the state-of-the-art while running in real-time.
arXiv Detail & Related papers (2024-07-10T18:55:43Z)
VG4D: Vision-Language Model Goes 4D Video Recognition [34.98194339741201]
Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can learn fine-grained visual concepts. We propose the Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from visual-text pre-trained models to a 4D point cloud network.
arXiv Detail & Related papers (2024-04-17T17:54:49Z)
Comp4D: LLM-Guided Compositional 4D Scene Generation [65.5810466788355]
We present Comp4D, a novel framework for Compositional 4D Generation. Unlike conventional methods that generate a singular 4D representation of the entire scene, Comp4D innovatively constructs each 4D object within the scene separately. Our method employs a compositional score distillation technique guided by the pre-defined trajectories.
arXiv Detail & Related papers (2024-03-25T17:55:52Z)
4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency [118.15258850780417]
This work introduces 4DGen, a novel framework for grounded 4D content creation. We identify static 3D assets and monocular video sequences as key components in constructing the 4D content. Our pipeline facilitates conditional 4D generation, enabling users to specify geometry (3D assets) and motion (monocular videos)
arXiv Detail & Related papers (2023-12-28T18:53:39Z)
Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud Sequence Representation Learning [14.033085586047799]
This paper proposes a new 4D self-supervised pre-training method called Complete-to-Partial 4D Distillation. Our key idea is to formulate 4D self-supervised representation learning as a teacher-student knowledge distillation framework. Experiments show that this approach significantly outperforms previous pre-training approaches on a wide range of 4D point cloud sequence understanding tasks.
arXiv Detail & Related papers (2022-12-10T16:26:19Z)
X-Trans2Cap: Cross-Modal Knowledge Transfer using Transformer for 3D Dense Captioning [71.36623596807122]
3D dense captioning aims to describe individual objects by natural language in 3D scenes, where 3D scenes are usually represented as RGB-D scans or point clouds. In this study, we investigate a cross-modal knowledge transfer using Transformer for 3D dense captioning, X-Trans2Cap, to effectively boost the performance of single-modal 3D caption.
arXiv Detail & Related papers (2022-03-02T03:35:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.