RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
- URL: http://arxiv.org/abs/2511.22950v1
- Date: Fri, 28 Nov 2025 07:51:02 GMT
- Title: RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video
- Authors: Haiyang Mei, Qiming Huang, Hai Ci, Mike Zheng Shou,
- Abstract summary: We introduce RobotSeg, a foundation model for robot segmentation in image and video.<n>It addresses the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations.<n>It achieves state-of-the-art performance on both images and videos.
- Score: 56.9581053843815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Accurate robot segmentation is a fundamental capability for robotic perception. It enables precise visual servoing for VLA systems, scalable robot-centric data augmentation, accurate real-to-sim transfer, and reliable safety monitoring in dynamic human-robot environments. Despite the strong capabilities of modern segmentation models, surprisingly it remains challenging to segment robots. This is due to robot embodiment diversity, appearance ambiguity, structural complexity, and rapid shape changes. Embracing these challenges, we introduce RobotSeg, a foundation model for robot segmentation in image and video. RobotSeg is built upon the versatile SAM 2 foundation model but addresses its three limitations for robot segmentation, namely the lack of adaptation to articulated robots, reliance on manual prompts, and the need for per-frame training mask annotations, by introducing a structure-enhanced memory associator, a robot prompt generator, and a label-efficient training strategy. These innovations collectively enable a structure-aware, automatic, and label-efficient solution. We further construct the video robot segmentation (VRS) dataset comprising over 2.8k videos (138k frames) with diverse robot embodiments and environments. Extensive experiments demonstrate that RobotSeg achieves state-of-the-art performance on both images and videos, establishing a strong foundation for future advances in robot perception.
Related papers
- H2R-Grounder: A Paired-Data-Free Paradigm for Translating Human Interaction Videos into Physically Grounded Robot Videos [58.006918399913665]
We propose a video-to-video translation framework that converts ordinary human-object interaction videos into motion-consistent robot manipulation videos.<n>Our approach does not require any paired human-robot videos for training only a set of unpaired robot videos, making the system easy to scale.<n>At test time, we apply the same process to human videos (inpainting the person and overlaying human pose cues) and generate high-quality robot videos that mimic the human's actions.
arXiv Detail & Related papers (2025-12-10T07:59:45Z) - GR00T N1: An Open Foundation Model for Generalist Humanoid Robots [133.23509142762356]
General-purpose robots need a versatile body and an intelligent mind.<n>Recent advancements in humanoid robots have shown great promise as a hardware platform for building generalist autonomy.<n>We introduce GR00T N1, an open foundation model for humanoid robots.
arXiv Detail & Related papers (2025-03-18T21:06:21Z) - VidBot: Learning Generalizable 3D Actions from In-the-Wild 2D Human Videos for Zero-Shot Robotic Manipulation [53.63540587160549]
VidBot is a framework enabling zero-shot robotic manipulation using learned 3D affordance from in-the-wild monocular RGB-only human videos.<n> VidBot paves the way for leveraging everyday human videos to make robot learning more scalable.
arXiv Detail & Related papers (2025-03-10T10:04:58Z) - Controlling diverse robots by inferring Jacobian fields with deep networks [48.279199537720714]
Mirroring the complex structures and diverse functions of natural organisms is a long-standing challenge in robotics.<n>We introduce a method that uses deep neural networks to map a video stream of a robot to its visuomotor Jacobian field.<n>Our approach achieves accurate closed-loop control and recovers the causal dynamic structure of each robot.
arXiv Detail & Related papers (2024-07-11T17:55:49Z) - Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers [36.497624484863785]
We introduce Vid2Robot, an end-to-end video-conditioned policy that takes human videos demonstrating manipulation tasks as input and produces robot actions.
Our model is trained with a large dataset of prompt video-robot trajectory pairs to learn unified representations of human and robot actions from videos.
We evaluate Vid2Robot on real-world robots and observe over 20% improvement over BC-Z when using human prompt videos.
arXiv Detail & Related papers (2024-03-19T17:47:37Z) - RoboScript: Code Generation for Free-Form Manipulation Tasks across Real
and Simulation [77.41969287400977]
This paper presents textbfRobotScript, a platform for a deployable robot manipulation pipeline powered by code generation.
We also present a benchmark for a code generation benchmark for robot manipulation tasks in free-form natural language.
We demonstrate the adaptability of our code generation framework across multiple robot embodiments, including the Franka and UR5 robot arms.
arXiv Detail & Related papers (2024-02-22T15:12:00Z) - General-purpose foundation models for increased autonomy in
robot-assisted surgery [4.155479231940454]
This perspective article aims to provide a path toward increasing robot autonomy in robot-assisted surgery.
We argue that surgical robots are uniquely positioned to benefit from general-purpose models and provide three guiding actions toward increased autonomy in robot-assisted surgery.
arXiv Detail & Related papers (2024-01-01T06:15:16Z) - RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation [33.10577695383743]
We propose a multi-embodiment, multi-task generalist agent for robotic manipulation called RoboCat.
This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions.
With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100-1000 examples.
arXiv Detail & Related papers (2023-06-20T17:35:20Z) - Single-view robot pose and joint angle estimation via render & compare [40.05546237998603]
We introduce RoboPose, a method to estimate the joint angles and the 6D camera-to-robot pose of a known articulated robot from a single RGB image.
This is an important problem to grant mobile and itinerant autonomous systems the ability to interact with other robots.
arXiv Detail & Related papers (2021-04-19T14:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.