VLMgineer: Vision Language Models as Robotic Toolsmiths
- URL: http://arxiv.org/abs/2507.12644v1
- Date: Wed, 16 Jul 2025 21:30:05 GMT
- Title: VLMgineer: Vision Language Models as Robotic Toolsmiths
- Authors: George Jiayuan Gao, Tianyu Li, Junyao Shi, Yihan Li, Zizhe Zhang, Nadia Figueroa, Dinesh Jayaraman,
- Abstract summary: We present VLMgineer, a framework that harnesses the code generation abilities of vision language models (VLMs)<n>We evaluate VLMgineer on a diverse new benchmark of everyday manipulation scenarios that demand creative tool design and use.
- Score: 17.25435891046774
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Tool design and use reflect the ability to understand and manipulate the physical world through creativity, planning, and foresight. As such, these capabilities are often regarded as measurable indicators of intelligence across biological species. While much of today's research on robotic intelligence focuses on generating better controllers, inventing smarter tools offers a complementary form of physical intelligence: shifting the onus of problem-solving onto the tool's design. Given the vast and impressive common-sense, reasoning, and creative capabilities of today's foundation models, we investigate whether these models can provide useful priors to automatically design and effectively wield such tools? We present VLMgineer, a framework that harnesses the code generation abilities of vision language models (VLMs) together with evolutionary search to iteratively co-design physical tools and the action plans that operate them to perform a task. We evaluate VLMgineer on a diverse new benchmark of everyday manipulation scenarios that demand creative tool design and use. Across this suite, VLMgineer consistently discovers tools and policies that solve tasks more effectively and innovatively, transforming challenging robotics problems into straightforward executions. It also outperforms VLM-generated designs from human specifications and existing human-crafted tools for everyday tasks. To facilitate future research on automated tool invention, we will release our benchmark and code.
Related papers
- ToolCoder: A Systematic Code-Empowered Tool Learning Framework for Large Language Models [81.12673534903979]
Tool learning has emerged as a crucial capability for large language models (LLMs) to solve complex real-world tasks through interaction with external tools.<n>We propose ToolCoder, a novel framework that reformulates tool learning as a code generation task.
arXiv Detail & Related papers (2025-02-17T03:42:28Z) - LLM With Tools: A Survey [0.0]
This paper delves into the methodology,challenges, and developments in the realm of teaching LLMs to use external tools.
We introduce a standardized paradigm for tool integration guided by a series of functions that map user instructions to actionable plans.
Our exploration reveals the various challenges encountered, such as tool invocation timing, selection accuracy, and the need for robust reasoning processes.
arXiv Detail & Related papers (2024-09-24T14:08:11Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation [68.70755196744533]
RoboGen is a generative robotic agent that automatically learns diverse robotic skills at scale via generative simulation.
Our work attempts to extract the extensive and versatile knowledge embedded in large-scale models and transfer them to the field of robotics.
arXiv Detail & Related papers (2023-11-02T17:59:21Z) - Creative Robot Tool Use with Large Language Models [47.11935262923095]
This paper investigates the feasibility of imbuing robots with the ability to creatively use tools in tasks that involve implicit physical constraints and long-term planning.
We develop RoboTool, a system that accepts natural language instructions and outputs executable code for controlling robots in both simulated and real-world environments.
arXiv Detail & Related papers (2023-10-19T18:02:15Z) - Learning Generalizable Tool-use Skills through Trajectory Generation [13.879860388944214]
We train a single model on four different deformable object manipulation tasks.
The model generalizes to various novel tools, significantly outperforming baselines.
We further test our trained policy in the real world with unseen tools, where it achieves the performance comparable to human.
arXiv Detail & Related papers (2023-09-29T21:32:42Z) - CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models [74.22729793816451]
Large Language Models (LLMs) have made significant progress in utilizing tools, but their ability is limited by API availability.
We propose CREATOR, a novel framework that enables LLMs to create their own tools using documentation and code realization.
We evaluate CREATOR on MATH and TabMWP benchmarks, respectively consisting of challenging math competition problems.
arXiv Detail & Related papers (2023-05-23T17:51:52Z) - Tool Learning with Foundation Models [158.8640687353623]
With the advent of foundation models, AI systems have the potential to be equally adept in tool use as humans.
Despite its immense potential, there is still a lack of a comprehensive understanding of key challenges, opportunities, and future endeavors in this field.
arXiv Detail & Related papers (2023-04-17T15:16:10Z) - TANGO: Commonsense Generalization in Predicting Tool Interactions for
Mobile Manipulators [15.61285199988595]
We introduce TANGO, a novel neural model for predicting task-specific tool interactions.
TANGO encodes the world state comprising of objects and symbolic relationships between them using a graph neural network.
We show that by augmenting the representation of the environment with pre-trained embeddings derived from a knowledge-base, the model can generalize effectively to novel environments.
arXiv Detail & Related papers (2021-05-05T18:11:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.