Related papers: Towards Open-World Grasping with Large Vision-Language Models

Towards Open-World Grasping with Large Vision-Language Models

URL: http://arxiv.org/abs/2406.18722v4
Date: Sun, 13 Oct 2024 15:19:58 GMT
Title: Towards Open-World Grasping with Large Vision-Language Models
Authors: Georgios Tziafas, Hamidreza Kasaei,
Abstract summary: An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning. We propose OWG, an open-world grasping pipeline that combines vision-language models with segmentation and grasp synthesis models. We conduct evaluation in cluttered indoor scene datasets to showcase OWG's robustness in grounding from open-ended language.
Score: 5.317624228510749
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The ability to grasp objects in-the-wild from open-ended language instructions constitutes a fundamental challenge in robotics. An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning in order to be applicable in arbitrary scenarios. Recent works exploit the web-scale knowledge inherent in large language models (LLMs) to plan and reason in robotic context, but rely on external vision and action models to ground such knowledge into the environment and parameterize actuation. This setup suffers from two major bottlenecks: a) the LLM's reasoning capacity is constrained by the quality of visual grounding, and b) LLMs do not contain low-level spatial understanding of the world, which is essential for grasping in contact-rich scenarios. In this work we demonstrate that modern vision-language models (VLMs) are capable of tackling such limitations, as they are implicitly grounded and can jointly reason about semantics and geometry. We propose OWG, an open-world grasping pipeline that combines VLMs with segmentation and grasp synthesis models to unlock grounded world understanding in three stages: open-ended referring segmentation, grounded grasp planning and grasp ranking via contact reasoning, all of which can be applied zero-shot via suitable visual prompting mechanisms. We conduct extensive evaluation in cluttered indoor scene datasets to showcase OWG's robustness in grounding from open-ended language, as well as open-world robotic grasping experiments in both simulation and hardware that demonstrate superior performance compared to previous supervised and zero-shot LLM-based methods. Project material is available at https://gtziafas.github.io/OWG_project/ .

Related papers

Gondola: Grounded Vision Language Planning for Generalizable Robotic Manipulation [62.711546725154314]
We introduce Gondola, a grounded vision-language planning model based on large language models (LLMs) for generalizable robotic manipulation.<n>G Gondola takes multi-view images and history plans to produce the next action plan with interleaved texts and segmentation masks of target objects and locations.<n>G Gondola outperforms the state-of-the-art LLM-based method across all four levels of the GemBench dataset.
arXiv Detail & Related papers (2025-06-12T20:04:31Z)
Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z)
Learning to Ground VLMs without Forgetting [54.033346088090674]
We introduce LynX, a framework that equips pretrained Visual Language Models with visual grounding ability without forgetting their existing image and language understanding skills. To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding. We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning.
arXiv Detail & Related papers (2024-10-14T13:35:47Z)
Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets. However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs. This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z)
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z)
ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension [71.03445074045092]
We propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens. Our method unifies the prompt and answer of visual referential tasks without using additional syntax. ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency.
arXiv Detail & Related papers (2024-06-17T08:39:16Z)
Grounding Language Plans in Demonstrations Through Counterfactual Perturbations [25.19071357445557]
Grounding the common-sense reasoning of Large Language Models (LLMs) in physical domains remains a pivotal yet unsolved problem for embodied AI. We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks.
arXiv Detail & Related papers (2024-03-25T19:04:59Z)
Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning [32.045840007623276]
We introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning. ViLa directly integrates perceptual data into its reasoning and planning process. Our evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners.
arXiv Detail & Related papers (2023-11-29T17:46:25Z)
Language Models as Zero-Shot Trajectory Generators [10.572264780575564]
Large Language Models (LLMs) have recently shown promise as high-level planners for robots. It is often assumed that LLMs do not possess sufficient knowledge to be used for the low-level trajectories themselves. This work investigates if an LLM can directly predict a dense sequence of end-effector poses for manipulation tasks.
arXiv Detail & Related papers (2023-10-17T21:57:36Z)
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent [23.134180979449823]
3D visual grounding is a critical skill for household robots, enabling them to navigate, manipulate objects, and answer questions based on their environment. We propose LLM-Grounder, a novel zero-shot, open-vocabulary, Large Language Model (LLM)-based 3D visual grounding pipeline. Our findings indicate that LLMs significantly improve the grounding capability, especially for complex language queries.
arXiv Detail & Related papers (2023-09-21T17:59:45Z)
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation. We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects. We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z)
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks [81.32968995346775]
VisionLLM is a framework for vision-centric tasks that can be flexibly defined and managed using language instructions. Our model can achieve over 60% mAP on COCO, on par with detection-specific models.
arXiv Detail & Related papers (2023-05-18T17:59:42Z)
Grounded Decoding: Guiding Text Generation with Grounded Models for Embodied Agents [111.15288256221764]
Grounded-decoding project aims to solve complex, long-horizon tasks in a robotic setting by leveraging the knowledge of both models. We frame this as a problem similar to probabilistic filtering: decode a sequence that both has high probability under the language model and high probability under a set of grounded model objectives. We demonstrate how such grounded models can be obtained across three simulation and real-world domains, and that the proposed decoding strategy is able to solve complex, long-horizon tasks in a robotic setting by leveraging the knowledge of both models.
arXiv Detail & Related papers (2023-03-01T22:58:50Z)
Inner Monologue: Embodied Reasoning through Planning with Language Models [81.07216635735571]
Large Language Models (LLMs) can be applied to domains beyond natural language processing. LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them. We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios.
arXiv Detail & Related papers (2022-07-12T15:20:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.