Related papers: LatentBKI: Open-Dictionary Continuous Mapping in Visual-Language Latent Spaces with Quantifiable Uncertainty

LatentBKI: Open-Dictionary Continuous Mapping in Visual-Language Latent Spaces with Quantifiable Uncertainty

URL: http://arxiv.org/abs/2410.11783v2
Date: Tue, 21 Jan 2025 21:46:26 GMT
Title: LatentBKI: Open-Dictionary Continuous Mapping in Visual-Language Latent Spaces with Quantifiable Uncertainty
Authors: Joey Wilson, Ruihan Xu, Yile Sun, Parker Ewen, Minghan Zhu, Kira Barton, Maani Ghaffari,
Abstract summary: This paper introduces a novel probabilistic mapping algorithm, LatentBKI, which enables open-vocabulary mapping with quantifiable uncertainty.<n>LatentBKI is evaluated against similar explicit semantic mapping and VL mapping frameworks on the popular Matterport3D and Semantic KITTI datasets.<n>Real-world experiments demonstrate applicability to challenging indoor environments.
Score: 6.986230616834552
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper introduces a novel probabilistic mapping algorithm, LatentBKI, which enables open-vocabulary mapping with quantifiable uncertainty. Traditionally, semantic mapping algorithms focus on a fixed set of semantic categories which limits their applicability for complex robotic tasks. Vision-Language (VL) models have recently emerged as a technique to jointly model language and visual features in a latent space, enabling semantic recognition beyond a predefined, fixed set of semantic classes. LatentBKI recurrently incorporates neural embeddings from VL models into a voxel map with quantifiable uncertainty, leveraging the spatial correlations of nearby observations through Bayesian Kernel Inference (BKI). LatentBKI is evaluated against similar explicit semantic mapping and VL mapping frameworks on the popular Matterport3D and Semantic KITTI datasets, demonstrating that LatentBKI maintains the probabilistic benefits of continuous mapping with the additional benefit of open-dictionary queries. Real-world experiments demonstrate applicability to challenging indoor environments.

Related papers

FeatInv: Spatially resolved mapping from feature space to input space using conditional diffusion models [0.9503773054285559]
Internal representations are crucial for understanding deep neural networks.<n>While mapping from feature space to input space aids in interpreting the former, existing approaches often rely on crude approximations.<n>We propose using a conditional diffusion model to learn such a mapping in a probabilistic manner.
arXiv Detail & Related papers (2025-05-27T11:07:34Z)
PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models [2.2760325783059074]
We introduce PARIC, a probabilistic framework for guiding visual attention via language specifications. Our approach enables pre-trained vision-language models to generate probabilistic reference attention maps. Experiments on benchmark test problems demonstrate that PARIC enhances prediction accuracy, mitigates bias, ensures consistent predictions, and improves across various datasets.
arXiv Detail & Related papers (2025-03-14T12:53:37Z)
Post-hoc Probabilistic Vision-Language Models [51.12284891724463]
Vision-language models (VLMs) have found remarkable success in classification, retrieval, and generative tasks. We propose post-hoc uncertainty estimation in VLMs that does not require additional training. Our results show promise for safety-critical applications of large-scale models.
arXiv Detail & Related papers (2024-12-08T18:16:13Z)
Disentangling Dense Embeddings with Sparse Autoencoders [0.0]
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks. We present one of the first applications of SAEs to dense text embeddings from large language models. We show that the resulting sparse representations maintain semantic fidelity while offering interpretability.
arXiv Detail & Related papers (2024-08-01T15:46:22Z)
Evidential Semantic Mapping in Off-road Environments with Uncertainty-aware Bayesian Kernel Inference [5.120567378386614]
We propose an evidential semantic mapping framework, which can enhance reliability in perceptually challenging off-road environments. By adaptively handling semantic uncertainties, the proposed framework constructs robust representations of the surroundings even in previously unseen environments.
arXiv Detail & Related papers (2024-03-21T05:13:34Z)
Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised Semantic Segmentation [79.05949524349005]
We propose AuxSegNet+, a weakly supervised auxiliary learning framework to explore the rich information from saliency maps. We also propose a cross-task affinity learning mechanism to learn pixel-level affinities from the saliency and segmentation feature maps.
arXiv Detail & Related papers (2024-03-02T10:03:21Z)
Dirichlet Active Learning [1.4277428617774877]
Dirichlet Active Learning (DiAL) is a Bayesian-inspired approach to the design of active learning algorithms. Our framework models feature-conditional class probabilities as a Dirichlet random field.
arXiv Detail & Related papers (2023-11-09T16:39:02Z)
Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos [63.94040814459116]
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. We propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations.
arXiv Detail & Related papers (2023-08-19T09:12:13Z)
GFlowNet-EM for learning compositional latent variable models [115.96660869630227]
A key tradeoff in modeling the posteriors over latents is between expressivity and tractable optimization. We propose the use of GFlowNets, algorithms for sampling from an unnormalized density. By training GFlowNets to sample from the posterior over latents, we take advantage of their strengths as amortized variational algorithms.
arXiv Detail & Related papers (2023-02-13T18:24:21Z)
BEVBert: Multimodal Map Pre-training for Language-guided Navigation [75.23388288113817]
We propose a new map-based pre-training paradigm that is spatial-aware for use in vision-and-language navigation (VLN) We build a local metric map to explicitly aggregate incomplete observations and remove duplicates, while modeling navigation dependency in a global topological map. Based on the hybrid map, we devise a pre-training framework to learn a multimodal map representation, which enhances spatial-aware cross-modal reasoning thereby facilitating the language-guided navigation goal.
arXiv Detail & Related papers (2022-12-08T16:27:54Z)
Convolutional Bayesian Kernel Inference for 3D Semantic Mapping [1.7615233156139762]
We introduce a Convolutional Bayesian Kernel Inference layer which learns to perform explicit Bayesian inference. We learn semantic-geometric probability distributions for LiDAR sensor information and incorporate semantic predictions into a global map. We evaluate our network against state-of-the-art semantic mapping algorithms on the KITTI data set, demonstrating improved latency with comparable semantic label inference results.
arXiv Detail & Related papers (2022-09-21T21:15:12Z)
PROB-SLAM: Real-time Visual SLAM Based on Probabilistic Graph Optimization [0.0]
Traditional SLAM algorithms are typically based on artificial features, which lack high-level information. By introducing semantic information, SLAM can own higher stability and robustness rather than purely hand-crafted features. This paper proposed a novel probability map based on the Gaussian distribution assumption. We have demonstrated that the method can be successfully applied to environments containing dynamic objects.
arXiv Detail & Related papers (2022-09-15T05:47:17Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
Dive into Ambiguity: Latent Distribution Mining and Pairwise Uncertainty Estimation for Facial Expression Recognition [59.52434325897716]
We propose a solution, named DMUE, to address the problem of annotation ambiguity from two perspectives. For the former, an auxiliary multi-branch learning framework is introduced to better mine and describe the latent distribution in the label space. For the latter, the pairwise relationship of semantic feature between instances are fully exploited to estimate the ambiguity extent in the instance space.
arXiv Detail & Related papers (2021-04-01T03:21:57Z)
Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA) IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors. IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.