TriFusion-AE: Language-Guided Depth and LiDAR Fusion for Robust Point Cloud Processing
- URL: http://arxiv.org/abs/2509.18743v1
- Date: Tue, 23 Sep 2025 07:37:28 GMT
- Title: TriFusion-AE: Language-Guided Depth and LiDAR Fusion for Robust Point Cloud Processing
- Authors: Susmit Neogi,
- Abstract summary: Autoencoders offer a natural framework for denoising and reconstruction, but their performance degrades under challenging real-world conditions.<n>We propose TriFusion-AE, a cross-attention autoencoder that integrates textual priors, monocular depth maps from multi-view images, and LiDAR point clouds to improve robustness.<n>Our model achieves significantly more robust reconstruction under strong adversarial attacks and heavy noise, where CNN-based autoencoders collapse.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LiDAR-based perception is central to autonomous driving and robotics, yet raw point clouds remain highly vulnerable to noise, occlusion, and adversarial corruptions. Autoencoders offer a natural framework for denoising and reconstruction, but their performance degrades under challenging real-world conditions. In this work, we propose TriFusion-AE, a multimodal cross-attention autoencoder that integrates textual priors, monocular depth maps from multi-view images, and LiDAR point clouds to improve robustness. By aligning semantic cues from text, geometric (depth) features from images, and spatial structure from LiDAR, TriFusion-AE learns representations that are resilient to stochastic noise and adversarial perturbations. Interestingly, while showing limited gains under mild perturbations, our model achieves significantly more robust reconstruction under strong adversarial attacks and heavy noise, where CNN-based autoencoders collapse. We evaluate on the nuScenes-mini dataset to reflect realistic low-data deployment scenarios. Our multimodal fusion framework is designed to be model-agnostic, enabling seamless integration with any CNN-based point cloud autoencoder for joint representation learning.
Related papers
- VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction [0.0]
VLMFusionOcc3D is a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving.<n>We introduce Weather-Aware Adaptive Fusion, a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions.<n>Our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.
arXiv Detail & Related papers (2026-03-03T05:22:28Z) - Task-Driven Prompt Learning: A Joint Framework for Multi-modal Cloud Removal and Segmentation [11.468907022707013]
TDP-CR is a task-driven framework that jointly performs cloud removal and land-cover segmentation.<n>Central to our approach is a Prompt-Guided Fusion mechanism, which utilizes a learnable degradation prompt to encode cloud thickness and spatial uncertainty.<n>Experiments on the LuojiaSET-OSFCR dataset demonstrate the superiority of our framework.
arXiv Detail & Related papers (2026-01-17T13:32:38Z) - Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding [80.66591664266744]
Lemon is a unified transformer architecture that processes 3D point cloud patches and language tokens as a single sequence.<n>To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context.<n>Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks.
arXiv Detail & Related papers (2025-12-14T20:02:43Z) - High-Quality Proposal Encoding and Cascade Denoising for Imaginary Supervised Object Detection [20.075203668387136]
Existing object detection methods suffer from simplistic prompts, poor image quality, and weak supervision.<n>We propose Cascade HQP-DETR to address these limitations.<n>First, we introduce a high-quality data pipeline using LLaMA-3, Flux, and Grounding DINO to generate the FluxVOC and FluxCOCO datasets.<n>Second, our High-Quality Proposal guided query encodings object queries with image-specific priors from SAM-generated proposals.<n>Third, our cascade denoising algorithm dynamically adjusts training weights through progressively increasing IoU thresholds across decoder layers.
arXiv Detail & Related papers (2025-11-11T09:19:56Z) - Have We Scene It All? Scene Graph-Aware Deep Point Cloud Compression [18.40946383877556]
We propose a deep compression framework based on semantic scene graphs.<n>We show that the framework achieves state-of-the-art compression rates, reducing data size by up to 98%.<n>It supports downstream applications such as multi-robot pose graph optimization and map merging.
arXiv Detail & Related papers (2025-10-09T17:45:09Z) - R3GS: Gaussian Splatting for Robust Reconstruction and Relocalization in Unconstrained Image Collections [9.633163304379861]
R3GS is a robust reconstruction and relocalization framework tailored for unconstrained datasets.<n>To mitigate the adverse effects of transient objects on the reconstruction process, we ffne-tune a lightweight human detection network.<n>To address the challenges posed by sky regions in outdoor scenes, we propose an effective sky-handling technique that incorporates a depth prior as a constraint.
arXiv Detail & Related papers (2025-05-21T09:25:22Z) - Robust Unsupervised Domain Adaptation for 3D Point Cloud Segmentation Under Source Adversarial Attacks [9.578322021478426]
Unsupervised domain adaptation (UDA) frameworks have shown good generalization capabilities for 3D point cloud semantic segmentation models on clean data.<n>We propose a stealthy adversarial point cloud generation attack that can significantly contaminate datasets with only minor perturbations to the point cloud surface.<n>With the generated corrupted data, we further develop the Adversarial Adaptation Framework (AAF) as the countermeasure.
arXiv Detail & Related papers (2025-04-02T12:11:34Z) - RelitLRM: Generative Relightable Radiance for Large Reconstruction Models [52.672706620003765]
We propose RelitLRM for generating high-quality Gaussian splatting representations of 3D objects under novel illuminations.
Unlike prior inverse rendering methods requiring dense captures and slow optimization, RelitLRM adopts a feed-forward transformer-based model.
We show our sparse-view feed-forward RelitLRM offers competitive relighting results to state-of-the-art dense-view optimization-based baselines.
arXiv Detail & Related papers (2024-10-08T17:40:01Z) - Few-shot point cloud reconstruction and denoising via learned Guassian splats renderings and fine-tuned diffusion features [52.62053703535824]
We propose a method to reconstruct point clouds from few images and to denoise point clouds from their rendering.
To improve reconstruction in constraint settings, we regularize the training of a differentiable with hybrid surface and appearance.
We demonstrate how these learned filters can be used to remove point cloud noise coming without 3D supervision.
arXiv Detail & Related papers (2024-04-01T13:38:16Z) - DaRF: Boosting Radiance Fields from Sparse Inputs with Monocular Depth
Adaptation [31.655818586634258]
We propose a novel framework, dubbed D"aRF, that achieves robust NeRF reconstruction with a handful of real-world images.
Our framework imposes the MDE network's powerful geometry prior to NeRF representation at both seen and unseen viewpoints.
In addition, we overcome the ambiguity problems of monocular depths through patch-wise scale-shift fitting and geometry distillation.
arXiv Detail & Related papers (2023-05-30T16:46:41Z) - StarNet: Style-Aware 3D Point Cloud Generation [82.30389817015877]
StarNet is able to reconstruct and generate high-fidelity and even 3D point clouds using a mapping network.
Our framework achieves comparable state-of-the-art performance on various metrics in the point cloud reconstruction and generation tasks.
arXiv Detail & Related papers (2023-03-28T08:21:44Z) - Self-Supervised Point Cloud Representation Learning with Occlusion
Auto-Encoder [63.77257588569852]
We present 3D Occlusion Auto-Encoder (3D-OAE) for learning representations for point clouds.
Our key idea is to randomly occlude some local patches of the input point cloud and establish the supervision via recovering the occluded patches.
In contrast with previous methods, our 3D-OAE can remove a large proportion of patches and predict them only with a small number of visible patches.
arXiv Detail & Related papers (2022-03-26T14:06:29Z) - Pseudo-LiDAR Point Cloud Interpolation Based on 3D Motion Representation
and Spatial Supervision [68.35777836993212]
We propose a Pseudo-LiDAR point cloud network to generate temporally and spatially high-quality point cloud sequences.
By exploiting the scene flow between point clouds, the proposed network is able to learn a more accurate representation of the 3D spatial motion relationship.
arXiv Detail & Related papers (2020-06-20T03:11:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.