RobotDesignGPT: Automated Robot Design Synthesis using Vision Language Models
- URL: http://arxiv.org/abs/2601.11801v1
- Date: Fri, 16 Jan 2026 22:04:49 GMT
- Title: RobotDesignGPT: Automated Robot Design Synthesis using Vision Language Models
- Authors: Nitish Sontakke, K. Niranjan Kumar, Sehoon Ha,
- Abstract summary: We propose a novel automated robot design framework, RobotDesignGPT, to automate the robot design process.<n>Our framework synthesizes an initial robot design from a simple user prompt synthesis and a reference image.<n>We demonstrate that our framework can design visually appealing and kinematically valid robots inspired by nature, ranging from legged animals to flying creatures.
- Score: 8.028867584692312
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Robot design is a nontrivial process that involves careful consideration of multiple criteria, including user specifications, kinematic structures, and visual appearance. Therefore, the design process often relies heavily on domain expertise and significant human effort. The majority of current methods are rule-based, requiring the specification of a grammar or a set of primitive components and modules that can be composed to create a design. We propose a novel automated robot design framework, RobotDesignGPT, that leverages the general knowledge and reasoning capabilities of large pre-trained vision-language models to automate the robot design synthesis process. Our framework synthesizes an initial robot design from a simple user prompt and a reference image. Our novel visual feedback approach allows us to greatly improve the design quality and reduce unnecessary manual feedback. We demonstrate that our framework can design visually appealing and kinematically valid robots inspired by nature, ranging from legged animals to flying creatures. We justify the proposed framework by conducting an ablation study and a user study.
Related papers
- RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video [56.9581053843815]
We introduce RobotSeg, a foundation model for robot segmentation in image and video.<n>It addresses the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations.<n>It achieves state-of-the-art performance on both images and videos.
arXiv Detail & Related papers (2025-11-28T07:51:02Z) - Large Language Models as Natural Selector for Embodied Soft Robot Design [5.023206838671049]
This paper introduces RoboCrafter-QA, a novel benchmark to evaluate whether Large Language Models can learn representations of soft robot designs.<n>Our experiments reveal that while these models exhibit promising capabilities in learning design representations, they struggle with fine-grained distinctions between designs with subtle performance differences.
arXiv Detail & Related papers (2025-03-04T03:55:10Z) - On the Exploration of LM-Based Soft Modular Robot Design [26.847859137653487]
Large language models (LLMs) have demonstrated promising capabilities in modeling real-world knowledge.
In this paper, we explore the potential of using LLMs to aid in the design of soft modular robots.
Our model performs well in evaluations for designing soft modular robots with uni- and bi-directional and stair-descending capabilities.
arXiv Detail & Related papers (2024-11-01T04:03:05Z) - $π_0$: A Vision-Language-Action Flow Model for General Robot Control [77.32743739202543]
We propose a novel flow matching architecture built on top of a pre-trained vision-language model (VLM) to inherit Internet-scale semantic knowledge.
We evaluate our model in terms of its ability to perform tasks in zero shot after pre-training, follow language instructions from people, and its ability to acquire new skills via fine-tuning.
arXiv Detail & Related papers (2024-10-31T17:22:30Z) - Differentiable Robot Rendering [45.23538293501457]
We introduce differentiable robot rendering, a method allowing the visual appearance of a robot body to be directly differentiable with respect to its control parameters.
We demonstrate its capability and usage in applications including reconstruction of robot poses from images and controlling robots through vision language models.
arXiv Detail & Related papers (2024-10-17T17:59:02Z) - Controlling diverse robots by inferring Jacobian fields with deep networks [48.279199537720714]
Mirroring the complex structures and diverse functions of natural organisms is a long-standing challenge in robotics.<n>We introduce a method that uses deep neural networks to map a video stream of a robot to its visuomotor Jacobian field.<n>Our approach achieves accurate closed-loop control and recovers the causal dynamic structure of each robot.
arXiv Detail & Related papers (2024-07-11T17:55:49Z) - Text2Robot: Evolutionary Robot Design from Text Descriptions [3.054307340752497]
We introduce Text2Robot, a framework that converts user text specifications and performance preferences into physical quadrupedal robots.<n>Text2Robot enables rapid prototyping and opens new opportunities for robot design with generative models.
arXiv Detail & Related papers (2024-06-28T14:51:01Z) - RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis [102.1876259853457]
We propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX.
RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints.
To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning.
arXiv Detail & Related papers (2024-02-25T15:31:43Z) - RoboScript: Code Generation for Free-Form Manipulation Tasks across Real
and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation.
We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language.
We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z) - Singing the Body Electric: The Impact of Robot Embodiment on User
Expectations [7.408858358967414]
Users develop mental models of robots to conceptualize what kind of interactions they can have with those robots.
conceptualizations are often formed before interactions with the robot and are based only on observing the robot's physical design.
We propose to use multimodal features of robot embodiments to predict what kinds of expectations users will have about a given robot's social and physical capabilities.
arXiv Detail & Related papers (2024-01-13T04:42:48Z) - Robot Learning with Sensorimotor Pre-training [98.7755895548928]
We present a self-supervised sensorimotor pre-training approach for robotics.
Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens.
We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
arXiv Detail & Related papers (2023-06-16T17:58:10Z) - Can Foundation Models Perform Zero-Shot Task Specification For Robot
Manipulation? [54.442692221567796]
Task specification is critical for engagement of non-expert end-users and adoption of personalized robots.
A widely studied approach to task specification is through goals, using either compact state vectors or goal images from the same robot scene.
In this work, we explore alternate and more general forms of goal specification that are expected to be easier for humans to specify and use.
arXiv Detail & Related papers (2022-04-23T19:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.