Compositional Semantics for Open Vocabulary Spatio-semantic
Representations
- URL: http://arxiv.org/abs/2310.04981v1
- Date: Sun, 8 Oct 2023 03:07:14 GMT
- Title: Compositional Semantics for Open Vocabulary Spatio-semantic
Representations
- Authors: Robin Karlsson, Francisco Lepe-Salazar, Kazuya Takeda
- Abstract summary: General-purpose mobile robots need to complete tasks without exact human instructions.
We propose latent semantic embeddings z* as principled learning-based knowledge representation for queryable-semantic memories.
We demonstrate that a simple dense VLM trained on the COCO-Stuff dataset can learn z* for 181 overlapping semantics by 42.23 mIoU.
- Score: 4.045603788443984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: General-purpose mobile robots need to complete tasks without exact human
instructions. Large language models (LLMs) is a promising direction for
realizing commonsense world knowledge and reasoning-based planning.
Vision-language models (VLMs) transform environment percepts into
vision-language semantics interpretable by LLMs. However, completing complex
tasks often requires reasoning about information beyond what is currently
perceived. We propose latent compositional semantic embeddings z* as a
principled learning-based knowledge representation for queryable
spatio-semantic memories. We mathematically prove that z* can always be found,
and the optimal z* is the centroid for any set Z. We derive a probabilistic
bound for estimating separability of related and unrelated semantics. We prove
that z* is discoverable by iterative optimization by gradient descent from
visual appearance and singular descriptions. We experimentally verify our
findings on four embedding spaces incl. CLIP and SBERT. Our results show that
z* can represent up to 10 semantics encoded by SBERT, and up to 100 semantics
for ideal uniformly distributed high-dimensional embeddings. We demonstrate
that a simple dense VLM trained on the COCO-Stuff dataset can learn z* for 181
overlapping semantics by 42.23 mIoU, while improving conventional
non-overlapping open-vocabulary segmentation performance by +3.48 mIoU compared
with a popular SOTA model.
Related papers
- OTFusion: Bridging Vision-only and Vision-Language Models via Optimal Transport for Transductive Zero-Shot Learning [5.818420448447699]
OTFusion aims to learn a shared probabilistic representation that aligns visual and semantic information.<n> OTFusion consistently outperforms the original CLIP model, achieving an average accuracy improvement of nearly $10%$.
arXiv Detail & Related papers (2025-06-16T17:27:47Z) - Latent BKI: Open-Dictionary Continuous Mapping in Visual-Language Latent Spaces with Quantifiable Uncertainty [6.986230616834552]
This paper introduces a novel probabilistic mapping algorithm, Latent BKI, which enables open-vocabulary mapping with quantifiable uncertainty.
Latent BKI is evaluated against similar explicit semantic mapping and VL mapping frameworks on the popular MatterPort-3D and Semantic KITTI data sets.
Real-world experiments demonstrate applicability to challenging indoor environments.
arXiv Detail & Related papers (2024-10-15T17:02:32Z) - Epsilon: Exploring Comprehensive Visual-Semantic Projection for Multi-Label Zero-Shot Learning [23.96220607033524]
This paper investigates a challenging problem of zero-shot learning in the multi-label scenario (MLZSL)
It is trained to recognize multiple unseen classes within a sample based on seen classes and auxiliary knowledge.
We propose a novel and comprehensive visual-semantic framework for MLZSL, dubbed Epsilon, to fully make use of such properties.
arXiv Detail & Related papers (2024-08-22T09:45:24Z) - Disentangling Dense Embeddings with Sparse Autoencoders [0.0]
Sparse autoencoders (SAEs) have shown promise in extracting interpretable features from complex neural networks.
We present one of the first applications of SAEs to dense text embeddings from large language models.
We show that the resulting sparse representations maintain semantic fidelity while offering interpretability.
arXiv Detail & Related papers (2024-08-01T15:46:22Z) - SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations [13.608653575298183]
We introduce the SUGARCREPE++ dataset to analyze the sensitivity of vision-and-language models to semantic alterations.
We show that all the models which achieve better performance on compositionality datasets need not perform equally well on SUGARCREPE++.
arXiv Detail & Related papers (2024-06-17T03:22:20Z) - Quantifying Semantic Emergence in Language Models [31.608080868988825]
Large language models (LLMs) are widely recognized for their exceptional capacity to capture semantics meaning.
In this work, we introduce a quantitative metric, Information Emergence (IE), designed to measure LLMs' ability to extract semantics from input tokens.
arXiv Detail & Related papers (2024-05-21T09:12:20Z) - Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning [56.65891462413187]
We propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT)
ZSLViT first introduces semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement.
Then, we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement.
arXiv Detail & Related papers (2024-04-11T12:59:38Z) - SemGrasp: Semantic Grasp Generation via Language Aligned Discretization [53.43801984965309]
This paper presents a novel semantic-based grasp generation method, termed SemGrasp.
We introduce a discrete representation that aligns the grasp space with semantic space, enabling the generation of grasp postures.
A Multimodal Large Language Model (MLLM) is subsequently fine-tuned, integrating object, grasp, and language within a unified semantic space.
arXiv Detail & Related papers (2024-04-04T16:58:26Z) - Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics [50.982315553104975]
We investigate the bottom-up evolution of lexical semantics for a popular large language model, namely Llama2.
Our experiments show that the representations in lower layers encode lexical semantics, while the higher layers, with weaker semantic induction, are responsible for prediction.
This is in contrast to models with discriminative objectives, such as mask language modeling, where the higher layers obtain better lexical semantics.
arXiv Detail & Related papers (2024-03-03T13:14:47Z) - SEER-ZSL: Semantic Encoder-Enhanced Representations for Generalized Zero-Shot Learning [0.6792605600335813]
Zero-Shot Learning (ZSL) presents the challenge of identifying categories not seen during training.
We introduce a Semantic-Enhanced Representations for Zero-Shot Learning (SEER-ZSL)
First, we aim to distill meaningful semantic information using a probabilistic encoder, enhancing the semantic consistency and robustness.
Second, we distill the visual space by exploiting the learned data distribution through an adversarially trained generator. Third, we align the distilled information, enabling a mapping of unseen categories onto the true data manifold.
arXiv Detail & Related papers (2023-12-20T15:18:51Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Semantic-aware Contrastive Learning for More Accurate Semantic Parsing [32.74456368167872]
We propose a semantic-aware contrastive learning algorithm, which can learn to distinguish fine-grained meaning representations.
Experiments on two standard datasets show that our approach achieves significant improvements over MLE baselines.
arXiv Detail & Related papers (2023-01-19T07:04:32Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - Integrating Language Guidance into Vision-based Deep Metric Learning [78.18860829585182]
We propose to learn metric spaces which encode semantic similarities as embedding space.
These spaces should be transferable to classes beyond those seen during training.
This causes learned embedding spaces to encode incomplete semantic context and misrepresent the semantic relation between classes.
arXiv Detail & Related papers (2022-03-16T11:06:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.