SemGrasp: Semantic Grasp Generation via Language Aligned Discretization
- URL: http://arxiv.org/abs/2404.03590v1
- Date: Thu, 4 Apr 2024 16:58:26 GMT
- Title: SemGrasp: Semantic Grasp Generation via Language Aligned Discretization
- Authors: Kailin Li, Jingbo Wang, Lixin Yang, Cewu Lu, Bo Dai,
- Abstract summary: This paper presents a novel semantic-based grasp generation method, termed SemGrasp.
We introduce a discrete representation that aligns the grasp space with semantic space, enabling the generation of grasp postures.
A Multimodal Large Language Model (MLLM) is subsequently fine-tuned, integrating object, grasp, and language within a unified semantic space.
- Score: 53.43801984965309
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating natural human grasps necessitates consideration of not just object geometry but also semantic information. Solely depending on object shape for grasp generation confines the applications of prior methods in downstream tasks. This paper presents a novel semantic-based grasp generation method, termed SemGrasp, which generates a static human grasp pose by incorporating semantic information into the grasp representation. We introduce a discrete representation that aligns the grasp space with semantic space, enabling the generation of grasp postures in accordance with language instructions. A Multimodal Large Language Model (MLLM) is subsequently fine-tuned, integrating object, grasp, and language within a unified semantic space. To facilitate the training of SemGrasp, we have compiled a large-scale, grasp-text-aligned dataset named CapGrasp, featuring about 260k detailed captions and 50k diverse grasps. Experimental findings demonstrate that SemGrasp efficiently generates natural human grasps in alignment with linguistic intentions. Our code, models, and dataset are available publicly at: https://kailinli.github.io/SemGrasp.
Related papers
- GROUNDHOG: Grounding Large Language Models to Holistic Segmentation [22.347590874621865]
We introduce GROUNDHOG, an MLLM developed by grounding Large Language Models to holistic segmentation.
GROUNDHOG incorporates a masked feature extractor and converts extracted features into visual entity tokens for the MLLM backbone.
Our experimental results show that GROUNDHOG achieves superior performance on various language grounding tasks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-02-26T18:59:33Z) - SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language
Guidance [97.00445262074595]
In SemiVL, we propose to integrate rich priors from vision-language models into semi-supervised semantic segmentation.
We design a language-guided decoder to jointly reason over vision and language.
We evaluate SemiVL on 4 semantic segmentation datasets, where it significantly outperforms previous semi-supervised methods.
arXiv Detail & Related papers (2023-11-27T19:00:06Z) - Generating Data for Symbolic Language with Large Language Models [16.529863710055004]
Large language models (LLMs) have been developed to generate data for natural language tasks.
In this paper, we propose SymGen which utilizes LLMs for generating various annotation-expensive symbolic language data.
We show that generated data with only a few human demonstrations can be as effective as over 10 times the amount of human-annotated data when training the task model.
arXiv Detail & Related papers (2023-05-23T10:44:00Z) - Linking Emergent and Natural Languages via Corpus Transfer [98.98724497178247]
We propose a novel way to establish a link by corpus transfer between emergent languages and natural languages.
Our approach showcases non-trivial transfer benefits for two different tasks -- language modeling and image captioning.
We also introduce a novel metric to predict the transferability of an emergent language by translating emergent messages to natural language captions grounded on the same images.
arXiv Detail & Related papers (2022-03-24T21:24:54Z) - Integrating Language Guidance into Vision-based Deep Metric Learning [78.18860829585182]
We propose to learn metric spaces which encode semantic similarities as embedding space.
These spaces should be transferable to classes beyond those seen during training.
This causes learned embedding spaces to encode incomplete semantic context and misrepresent the semantic relation between classes.
arXiv Detail & Related papers (2022-03-16T11:06:50Z) - Constrained Language Models Yield Few-Shot Semantic Parsers [73.50960967598654]
We explore the use of large pretrained language models as few-shot semantics.
The goal in semantic parsing is to generate a structured meaning representation given a natural language input.
We use language models to paraphrase inputs into a controlled sublanguage resembling English that can be automatically mapped to a target meaning representation.
arXiv Detail & Related papers (2021-04-18T08:13:06Z) - Infusing Finetuning with Semantic Dependencies [62.37697048781823]
We show that, unlike syntax, semantics is not brought to the surface by today's pretrained models.
We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning.
arXiv Detail & Related papers (2020-12-10T01:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.