Related papers: Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation

Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation

URL: http://arxiv.org/abs/2504.14231v1
Date: Sat, 19 Apr 2025 08:53:54 GMT
Title: Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation
Authors: Johannes Spoecklberger, Wei Lin, Pedro Hermosilla, Sivan Doveh, Horst Possegger, M. Jehanzeb Mirza,
Abstract summary: Vision Foundation Models (VFMs) have become a de facto choice for many downstream vision tasks, like image classification, image segmentation, and object localization.<n>In our work, we explore the utility of VFMs for adapting from a labeled source to unlabeled target data for the task of LiDAR-based 3D semantic segmentation.<n>Our method consumes paired 2D-3D (image and point cloud) data and relies on the robust (cross-domain) features from a VFM to train a 3D backbone on a mix of labeled source and unlabeled target data.
Score: 14.651682743504024
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision Foundation Models (VFMs) have become a de facto choice for many downstream vision tasks, like image classification, image segmentation, and object localization. However, they can also provide significant utility for downstream 3D tasks that can leverage the cross-modal information (e.g., from paired image data). In our work, we further explore the utility of VFMs for adapting from a labeled source to unlabeled target data for the task of LiDAR-based 3D semantic segmentation. Our method consumes paired 2D-3D (image and point cloud) data and relies on the robust (cross-domain) features from a VFM to train a 3D backbone on a mix of labeled source and unlabeled target data. At the heart of our method lies a fusion network that is guided by both the image and point cloud streams, with their relative contributions adjusted based on the target domain. We extensively compare our proposed methodology with different state-of-the-art methods in several settings and achieve strong performance gains. For example, achieving an average improvement of 6.5 mIoU (over all tasks), when compared with the previous state-of-the-art.

Related papers

DINO in the Room: Leveraging 2D Foundation Models for 3D Segmentation [51.43837087865105]
Vision foundation models (VFMs) trained on large-scale image datasets provide high-quality features that have significantly advanced 2D visual recognition. Their potential in 3D vision remains largely untapped, despite the common availability of 2D images alongside 3D point cloud datasets. We introduce DITR, a simple yet effective approach that extracts 2D foundation model features, projects them to 3D, and finally injects them into a 3D point cloud segmentation model.
arXiv Detail & Related papers (2025-03-24T17:59:11Z)
LiOn-XA: Unsupervised Domain Adaptation via LiDAR-Only Cross-Modal Adversarial Training [61.26381389532653]
LiOn-XA is an unsupervised domain adaptation (UDA) approach that combines LiDAR-Only Cross-Modal (X) learning with Adversarial training for 3D LiDAR point cloud semantic segmentation. Our experiments on 3 real-to-real adaptation scenarios demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-10-21T09:50:17Z)
VLMine: Long-Tail Data Mining with Vision Language Models [18.412533708652102]
This work focuses on the problem of identifying rare examples within a corpus of unlabeled data. We propose a simple and scalable data mining approach that leverages the knowledge contained within a large vision language model (VLM) Our experiments consistently show large improvements (between 10% and 50%) over the baseline techniques.
arXiv Detail & Related papers (2024-09-23T19:13:51Z)
Visual Foundation Models Boost Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation [17.875516787157018]
We study how to harness the knowledge priors learned by 2D visual foundation models to produce more accurate labels for unlabeled target domains. Our method is evaluated on various autonomous driving datasets and the results demonstrate a significant improvement for 3D segmentation task.
arXiv Detail & Related papers (2024-03-15T03:58:17Z)
CMDA: Cross-Modal and Domain Adversarial Adaptation for LiDAR-Based 3D Object Detection [14.063365469339812]
LiDAR-based 3D Object Detection methods often do not generalize well to target domains outside the source (or training) data distribution. We introduce a novel unsupervised domain adaptation (UDA) method, called CMDA, which leverages visual semantic cues from an image modality. We also introduce a self-training-based learning strategy, wherein a model is adversarially trained to generate domain-invariant features.
arXiv Detail & Related papers (2024-03-06T14:12:38Z)
3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images. We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image. We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z)
Cross-modal & Cross-domain Learning for Unsupervised LiDAR Semantic Segmentation [82.47872784972861]
Cross-modal domain adaptation has been studied on the paired 2D image and 3D LiDAR data to ease the labeling costs for 3D LiDAR semantic segmentation (3DLSS) in the target domain. This paper studies a new 3DLSS setting where a 2D dataset with semantic annotations and a paired but unannotated 2D image and 3D LiDAR data (target) are available. To achieve 3DLSS in this scenario, we propose Cross-Modal and Cross-Domain Learning (CoMoDaL)
arXiv Detail & Related papers (2023-08-05T14:00:05Z)
CrossLoc3D: Aerial-Ground Cross-Source 3D Place Recognition [45.16530801796705]
CrossLoc3D is a novel 3D place recognition method that solves a large-scale point matching problem in a cross-source setting. We present CS-Campus3D, the first 3D aerial-ground cross-source dataset consisting of point cloud data from both aerial and ground LiDAR scans.
arXiv Detail & Related papers (2023-03-31T02:50:52Z)
Unleash the Potential of Image Branch for Cross-modal 3D Object Detection [67.94357336206136]
We present a new cross-modal 3D object detector, namely UPIDet, which aims to unleash the potential of the image branch from two aspects. First, UPIDet introduces a new 2D auxiliary task called normalized local coordinate map estimation. Second, we discover that the representational capability of the point cloud backbone can be enhanced through the gradients backpropagated from the training objectives of the image branch.
arXiv Detail & Related papers (2023-01-22T08:26:58Z)
Similarity-Aware Fusion Network for 3D Semantic Segmentation [87.51314162700315]
We propose a similarity-aware fusion network (SAFNet) to adaptively fuse 2D images and 3D point clouds for 3D semantic segmentation. We employ a late fusion strategy where we first learn the geometric and contextual similarities between the input and back-projected (from 2D pixels) point clouds. We show that SAFNet significantly outperforms existing state-of-the-art fusion-based approaches across various data integrity.
arXiv Detail & Related papers (2021-07-04T09:28:18Z)
Generation for adaption: a Gan-based approach for 3D Domain Adaption inPoint Cloud [10.614067060304919]
Unsupervised domain adaptation (UDA) seeks to overcome such a problem without target domain labels. We propose a method that use a Generative adversarial network to generate synthetic data from the source domain. Experiments show that our approach performs better than other state-of-the-art UDA methods in three popular 3D object/scene datasets.
arXiv Detail & Related papers (2021-02-15T07:24:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.