Compositional Semantics for Open Vocabulary Spatio-semantic
Representations
- URL: http://arxiv.org/abs/2310.04981v1
- Date: Sun, 8 Oct 2023 03:07:14 GMT
- Title: Compositional Semantics for Open Vocabulary Spatio-semantic
Representations
- Authors: Robin Karlsson, Francisco Lepe-Salazar, Kazuya Takeda
- Abstract summary: General-purpose mobile robots need to complete tasks without exact human instructions.
We propose latent semantic embeddings z* as principled learning-based knowledge representation for queryable-semantic memories.
We demonstrate that a simple dense VLM trained on the COCO-Stuff dataset can learn z* for 181 overlapping semantics by 42.23 mIoU.
- Score: 4.045603788443984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: General-purpose mobile robots need to complete tasks without exact human
instructions. Large language models (LLMs) is a promising direction for
realizing commonsense world knowledge and reasoning-based planning.
Vision-language models (VLMs) transform environment percepts into
vision-language semantics interpretable by LLMs. However, completing complex
tasks often requires reasoning about information beyond what is currently
perceived. We propose latent compositional semantic embeddings z* as a
principled learning-based knowledge representation for queryable
spatio-semantic memories. We mathematically prove that z* can always be found,
and the optimal z* is the centroid for any set Z. We derive a probabilistic
bound for estimating separability of related and unrelated semantics. We prove
that z* is discoverable by iterative optimization by gradient descent from
visual appearance and singular descriptions. We experimentally verify our
findings on four embedding spaces incl. CLIP and SBERT. Our results show that
z* can represent up to 10 semantics encoded by SBERT, and up to 100 semantics
for ideal uniformly distributed high-dimensional embeddings. We demonstrate
that a simple dense VLM trained on the COCO-Stuff dataset can learn z* for 181
overlapping semantics by 42.23 mIoU, while improving conventional
non-overlapping open-vocabulary segmentation performance by +3.48 mIoU compared
with a popular SOTA model.
Related papers
- Unified Lexical Representation for Interpretable Visual-Language Alignment [52.059812317944434]
We introduce LexVLA, a more interpretable VLA framework by learning a unified lexical representation for both modalities without complex design.
We demonstrate that these two pre-trained uni-modal models can be well-aligned by fine-tuning on modest multi-modal dataset.
arXiv Detail & Related papers (2024-07-25T07:35:27Z) - SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations [13.608653575298183]
We introduce the SUGARCREPE++ dataset to analyze the sensitivity of vision-and-language models to semantic alterations.
We show that all the models which achieve better performance on compositionality datasets need not perform equally well on SUGARCREPE++.
arXiv Detail & Related papers (2024-06-17T03:22:20Z) - SemGrasp: Semantic Grasp Generation via Language Aligned Discretization [53.43801984965309]
This paper presents a novel semantic-based grasp generation method, termed SemGrasp.
We introduce a discrete representation that aligns the grasp space with semantic space, enabling the generation of grasp postures.
A Multimodal Large Language Model (MLLM) is subsequently fine-tuned, integrating object, grasp, and language within a unified semantic space.
arXiv Detail & Related papers (2024-04-04T16:58:26Z) - Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative LLMs Reflect Lexical Semantics [50.982315553104975]
We investigate the bottom-up evolution of lexical semantics for a popular large language model, namely Llama2.
Our experiments show that the representations in lower layers encode lexical semantics, while the higher layers, with weaker semantic induction, are responsible for prediction.
This is in contrast to models with discriminative objectives, such as mask language modeling, where the higher layers obtain better lexical semantics.
arXiv Detail & Related papers (2024-03-03T13:14:47Z) - Towards Realistic Zero-Shot Classification via Self Structural Semantic
Alignment [53.2701026843921]
Large-scale pre-trained Vision Language Models (VLMs) have proven effective for zero-shot classification.
In this paper, we aim at a more challenging setting, Realistic Zero-Shot Classification, which assumes no annotation but instead a broad vocabulary.
We propose the Self Structural Semantic Alignment (S3A) framework, which extracts structural semantic information from unlabeled data while simultaneously self-learning.
arXiv Detail & Related papers (2023-08-24T17:56:46Z) - Semantic-aware Contrastive Learning for More Accurate Semantic Parsing [32.74456368167872]
We propose a semantic-aware contrastive learning algorithm, which can learn to distinguish fine-grained meaning representations.
Experiments on two standard datasets show that our approach achieves significant improvements over MLE baselines.
arXiv Detail & Related papers (2023-01-19T07:04:32Z) - Relate to Predict: Towards Task-Independent Knowledge Representations
for Reinforcement Learning [11.245432408899092]
Reinforcement Learning can enable agents to learn complex tasks.
It is difficult to interpret the knowledge and reuse it across tasks.
In this paper, we introduce an inductive bias for explicit object-centered knowledge separation.
We show that the degree of explicitness in knowledge separation correlates with faster learning, better accuracy, better generalization, and better interpretability.
arXiv Detail & Related papers (2022-12-10T13:33:56Z) - Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR)
Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner.
Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z) - Integrating Language Guidance into Vision-based Deep Metric Learning [78.18860829585182]
We propose to learn metric spaces which encode semantic similarities as embedding space.
These spaces should be transferable to classes beyond those seen during training.
This causes learned embedding spaces to encode incomplete semantic context and misrepresent the semantic relation between classes.
arXiv Detail & Related papers (2022-03-16T11:06:50Z) - Bias-Eliminated Semantic Refinement for Any-Shot Learning [27.374052527155623]
We refine the coarse-grained semantic description for any-shot learning tasks.
A new model, namely, the semantic refinement Wasserstein generative adversarial network (SRWGAN) model, is designed.
We extensively evaluate model performance on six benchmark datasets.
arXiv Detail & Related papers (2022-02-10T04:15:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.