LidarCLIP or: How I Learned to Talk to Point Clouds
- URL: http://arxiv.org/abs/2212.06858v3
- Date: Tue, 2 May 2023 13:53:40 GMT
- Title: LidarCLIP or: How I Learned to Talk to Point Clouds
- Authors: Georg Hess, Adam Tonderski, Christoffer Petersson, Kalle {\AA}str\"om,
Lennart Svensson
- Abstract summary: LidarCLIP is a mapping from automotive point clouds to a pre-existing CLIP embedding space.
We show the effectiveness of LidarCLIP by demonstrating that lidar-based retrieval is on par with image-based retrieval.
We also explore zero-shot classification and show that LidarCLIP outperforms existing attempts to use CLIP for point clouds by a large margin.
- Score: 3.0623865942628594
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Research connecting text and images has recently seen several breakthroughs,
with models like CLIP, DALL-E 2, and Stable Diffusion. However, the connection
between text and other visual modalities, such as lidar data, has received less
attention, prohibited by the lack of text-lidar datasets. In this work, we
propose LidarCLIP, a mapping from automotive point clouds to a pre-existing
CLIP embedding space. Using image-lidar pairs, we supervise a point cloud
encoder with the image CLIP embeddings, effectively relating text and lidar
data with the image domain as an intermediary. We show the effectiveness of
LidarCLIP by demonstrating that lidar-based retrieval is generally on par with
image-based retrieval, but with complementary strengths and weaknesses. By
combining image and lidar features, we improve upon both single-modality
methods and enable a targeted search for challenging detection scenarios under
adverse sensor conditions. We also explore zero-shot classification and show
that LidarCLIP outperforms existing attempts to use CLIP for point clouds by a
large margin. Finally, we leverage our compatibility with CLIP to explore a
range of applications, such as point cloud captioning and lidar-to-image
generation, without any additional training. Code and pre-trained models are
available at https://github.com/atonderski/lidarclip.
Related papers
- Is CLIP ideal? No. Can we fix it? Yes! [30.71718499767702]
Contrastive Language-Image Pre-Training is a popular method for learning multimodal latent spaces with well-organized semantics.
Despite its wide range of applications, CLIP's latent space is known to fail at handling complex visual-textual interactions.
We propose Cosine Similarity Maps (DCSMs) as a principled and interpretable scoring method for CLIP-like models.
arXiv Detail & Related papers (2025-03-10T23:42:04Z) - Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation [19.749490092520006]
Self-Calibrated CLIP (SC-CLIP) is a training-free method that calibrates CLIP to produce finer-language representations.
SC-CLIP boosts the performance of vanilla CLIP ViT-L/14 by 6.8 times.
arXiv Detail & Related papers (2024-11-24T15:14:05Z) - TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - CLIP-based Point Cloud Classification via Point Cloud to Image Translation [19.836264118079573]
Contrastive Vision-Language Pre-training (CLIP) based point cloud classification model i.e. PointCLIP has added a new direction in the point cloud classification research domain.
We propose a Pretrained Point Cloud to Image Translation Network (PPCITNet) that produces generalized colored images along with additional salient visual cues to the point cloud depth maps.
arXiv Detail & Related papers (2024-08-07T04:50:05Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth
Pre-training [121.46758260964114]
Pre-training across 3D vision and language remains under development because of limited training data.
Recent works attempt to transfer vision-language pre-training models to 3D vision.
PointCLIP converts point cloud data to multi-view depth maps, adopting CLIP for shape classification.
We propose CLIP2Point, an image-depth pre-training method by contrastive learning to transfer CLIP to the 3D domain.
arXiv Detail & Related papers (2022-10-03T16:13:14Z) - No Token Left Behind: Explainability-Aided Image Classification and
Generation [79.4957965474334]
We present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input.
Our method yields an improvement in the recognition rate, without additional training or fine-tuning.
arXiv Detail & Related papers (2022-04-11T07:16:39Z) - PointCLIP: Point Cloud Understanding by CLIP [77.02399444893963]
We propose PointCLIP, which conducts alignment between CLIP-encoded point cloud and 3D category texts.
PointCLIP is a promising alternative for effective 3D point cloud understanding via CLIP under low resource cost and data regime.
arXiv Detail & Related papers (2021-12-04T19:42:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.