VG4D: Vision-Language Model Goes 4D Video Recognition
- URL: http://arxiv.org/abs/2404.11605v1
- Date: Wed, 17 Apr 2024 17:54:49 GMT
- Title: VG4D: Vision-Language Model Goes 4D Video Recognition
- Authors: Zhichao Deng, Xiangtai Li, Xia Li, Yunhai Tong, Shen Zhao, Mengyuan Liu,
- Abstract summary: Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can learn fine-grained visual concepts.
We propose the Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from visual-text pre-trained models to a 4D point cloud network.
- Score: 34.98194339741201
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the real world through point cloud video is a crucial aspect of robotics and autonomous driving systems. However, prevailing methods for 4D point cloud recognition have limitations due to sensor resolution, which leads to a lack of detailed information. Recent advances have shown that Vision-Language Models (VLM) pre-trained on web-scale text-image datasets can learn fine-grained visual concepts that can be transferred to various downstream tasks. However, effectively integrating VLM into the domain of 4D point clouds remains an unresolved problem. In this work, we propose the Vision-Language Models Goes 4D (VG4D) framework to transfer VLM knowledge from visual-text pre-trained models to a 4D point cloud network. Our approach involves aligning the 4D encoder's representation with a VLM to learn a shared visual and text space from training on large-scale image-text pairs. By transferring the knowledge of the VLM to the 4D encoder and combining the VLM, our VG4D achieves improved recognition performance. To enhance the 4D encoder, we modernize the classic dynamic point cloud backbone and propose an improved version of PSTNet, im-PSTNet, which can efficiently model point cloud videos. Experiments demonstrate that our method achieves state-of-the-art performance for action recognition on both the NTU RGB+D 60 dataset and the NTU RGB+D 120 dataset. Code is available at \url{https://github.com/Shark0-0/VG4D}.
Related papers
- VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding [57.04804711488706]
3D visual grounding is crucial for robots, requiring integration of natural language and 3D scene understanding.
We present VLM-Grounder, a novel framework using vision-language models (VLMs) for zero-shot 3D visual grounding based solely on 2D images.
arXiv Detail & Related papers (2024-10-17T17:59:55Z) - How Well Can Vision Language Models See Image Details? [53.036922527685064]
We introduce a pixel value prediction task to explore "How Well Can Vision Language Models See Image Details?"
Our research reveals that incorporating pixel value prediction as one of the VLM pre-training tasks and vision encoder adaptation markedly boosts VLM performance on downstream image-language understanding tasks.
arXiv Detail & Related papers (2024-08-07T17:59:40Z) - MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving
Representation Learning [38.6654451726187]
MIM4D is a novel pre-training paradigm based on dual masked image modeling (MIM)
It constructs pseudo-3D features using continuous scene flow and projects them onto 2D plane for supervision.
It achieves state-of-the-art performance on the nuScenes dataset for visual representation learning in autonomous driving.
arXiv Detail & Related papers (2024-03-13T17:58:00Z) - X4D-SceneFormer: Enhanced Scene Understanding on 4D Point Cloud Videos
through Cross-modal Knowledge Transfer [28.719098240737605]
We propose a novel cross-modal knowledge transfer framework, called X4D-SceneFormer.
It enhances 4D-Scene understanding by transferring texture priors from RGB sequences using a Transformer architecture with temporal relationship mining.
Experiments demonstrate the superior performance of our framework on various 4D point cloud video understanding tasks.
arXiv Detail & Related papers (2023-12-12T15:48:12Z) - Multi-view Vision-Prompt Fusion Network: Can 2D Pre-trained Model Boost
3D Point Cloud Data-scarce Learning? [38.06639044139636]
This work proposes a novel Multi-view Vision-Prompt Fusion Network (MvNet) for few-shot 3D point cloud classification.
MvNet achieves new state-of-the-art performance for 3D few-shot point cloud image classification.
arXiv Detail & Related papers (2023-04-20T11:39:41Z) - Text-To-4D Dynamic Scene Generation [111.89517759596345]
We present MAV3D (Make-A-Video3D), a method for generating three-dimensional dynamic scenes from text descriptions.
Our approach uses a 4D dynamic Neural Radiance Field (NeRF), which is optimized for scene appearance, density, and motion consistency.
The dynamic video output generated from the provided text can be viewed from any camera location and angle, and can be composited into any 3D environment.
arXiv Detail & Related papers (2023-01-26T18:14:32Z) - 4D-Net for Learned Multi-Modal Alignment [87.58354992455891]
We present 4D-Net, a 3D object detection approach, which utilizes 3D Point Cloud and RGB sensing information, both in time.
We are able to incorporate the 4D information by performing a novel connection learning across various feature representations and levels of abstraction, as well as by observing geometric constraints.
arXiv Detail & Related papers (2021-09-02T16:35:00Z) - V4D:4D Convolutional Neural Networks for Video-level Representation
Learning [58.548331848942865]
Most 3D CNNs for video representation learning are clip-based, and thus do not consider video-temporal evolution of features.
We propose Video-level 4D Conal Neural Networks, or V4D, to model long-range representation with 4D convolutions.
V4D achieves excellent results, surpassing recent 3D CNNs by a large margin.
arXiv Detail & Related papers (2020-02-18T09:27:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.