General-purpose Clothes Manipulation with Semantic Keypoints
- URL: http://arxiv.org/abs/2408.08160v3
- Date: Wed, 26 Mar 2025 06:56:09 GMT
- Title: General-purpose Clothes Manipulation with Semantic Keypoints
- Authors: Yuhong Deng, David Hsu,
- Abstract summary: This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP) for general-purpose clothes manipulation.<n>The key idea of CLASP is semantic keypoints -- e.g., "right shoulder", "left sleeve", etc. -- a sparse spatial-semantic representation that is salient for both perception and action.<n> experiments with a Kinova dual-arm system on four distinct tasks -- folding, flattening, hanging, and placing -- confirm CLASP's performance on a real robot.
- Score: 17.23980132793002
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Clothes manipulation is a critical capability for household robots; yet, existing methods are often confined to specific tasks, such as folding or flattening, due to the complex high-dimensional geometry of deformable fabric. This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP) for general-purpose clothes manipulation, which enables the robot to perform diverse manipulation tasks over different types of clothes. The key idea of CLASP is semantic keypoints -- e.g., "right shoulder", "left sleeve", etc. -- a sparse spatial-semantic representation that is salient for both perception and action. Semantic keypoints of clothes can be effectively extracted from depth images and are sufficient to represent a broad range of clothes manipulation policies. CLASP leverages semantic keypoints to bridge LLM-powered task planning and low-level action execution in a two-level hierarchy. Extensive simulation experiments show that CLASP outperforms baseline methods across diverse clothes types in both seen and unseen tasks. Further, experiments with a Kinova dual-arm system on four distinct tasks -- folding, flattening, hanging, and placing -- confirm CLASP's performance on a real robot.
Related papers
- CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks [76.00315860962885]
We propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework for unsupervised pre-training in human-centric visual tasks.<n> CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels.<n>MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability.
arXiv Detail & Related papers (2026-01-19T15:19:28Z) - Mash, Spread, Slice! Learning to Manipulate Object States via Visual Spatial Progress [53.723881111373736]
We present SPARTA, the first unified framework for the family of object state change manipulation tasks.<n>SPARTA integrates spatially progressing object change segmentation maps, a visual skill to perceive actionable vs. transformed regions, and dense rewards that capture incremental progress over time.<n>We validate SPARTA on a real robot for three challenging tasks across 10 diverse real-world objects.
arXiv Detail & Related papers (2025-09-28T23:56:07Z) - CLASP: General-Purpose Clothes Manipulation with Semantic Keypoints [21.09454149734247]
This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP), which aims at general-purpose clothes manipulation.<n>The core idea of CLASP is semantic keypoints -- e.g., ''left sleeve'', ''right shoulder'' -- a sparse spatial-semantic representation that is salient for both perception and action.<n>CLASP uses semantic keypoints to bridge high-level task planning and low-level action execution.
arXiv Detail & Related papers (2025-07-26T15:43:25Z) - Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids [56.892520712892804]
We introduce a practical sim-to-real RL recipe that trains a humanoid robot to perform three dexterous manipulation tasks.<n>We demonstrate high success rates on unseen objects and robust, adaptive policy behaviors.
arXiv Detail & Related papers (2025-02-27T18:59:52Z) - GarmentLab: A Unified Simulation and Benchmark for Garment Manipulation [12.940189262612677]
GarmentLab is a content-rich benchmark and realistic simulation designed for deformable object and garment manipulation.
Our benchmark encompasses a diverse range of garment types, robotic systems and manipulators.
We evaluate state-of-the-art vision methods, reinforcement learning, and imitation learning approaches on these tasks.
arXiv Detail & Related papers (2024-11-02T10:09:08Z) - Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics.
Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features.
We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z) - SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation [82.61572106180705]
This paper presents a unified approach using vision-language models (VLMs) to improve keypoint prediction across various garment categories.
We created a large-scale synthetic dataset using advanced simulation techniques, allowing scalable training without extensive real-world data.
Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates.
arXiv Detail & Related papers (2024-09-26T17:26:16Z) - Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models [53.22792173053473]
We introduce an interactive robotic manipulation framework called Polaris.
Polaris integrates perception and interaction by utilizing GPT-4 alongside grounded vision models.
We propose a novel Synthetic-to-Real (Syn2Real) pose estimation pipeline.
arXiv Detail & Related papers (2024-08-15T06:40:38Z) - DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research.
We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair.
We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z) - SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation [62.58480650443393]
Segment Anything (SAM) is a vision-foundation model for generalizable scene understanding and sequence imitation.
We develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass.
arXiv Detail & Related papers (2024-05-30T00:32:51Z) - UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence [6.9061350009929185]
Garment manipulation is essential for future robots to accomplish home-assistant tasks.
We leverage the property that, garments in a certain category have similar structures.
We then learn the topological dense (point-level) visual correspondence among garments in the category level with different deformations.
arXiv Detail & Related papers (2024-05-11T04:18:41Z) - ManiPose: A Comprehensive Benchmark for Pose-aware Object Manipulation in Robotics [55.85916671269219]
This paper introduces ManiPose, a pioneering benchmark designed to advance the study of pose-varying manipulation tasks.
A comprehensive dataset features geometrically consistent and manipulation-oriented 6D pose labels for 2936 real-world scanned rigid objects and 100 articulated objects.
Our benchmark demonstrates notable advancements in pose estimation, pose-aware manipulation, and real-robot skill transfer.
arXiv Detail & Related papers (2024-03-20T07:48:32Z) - Learning Reusable Manipulation Strategies [86.07442931141634]
Humans demonstrate an impressive ability to acquire and generalize manipulation "tricks"
We present a framework that enables machines to acquire such manipulation skills through a single demonstration and self-play.
These learned mechanisms and samplers can be seamlessly integrated into standard task and motion planners.
arXiv Detail & Related papers (2023-11-06T17:35:42Z) - KITE: Keypoint-Conditioned Policies for Semantic Manipulation [40.63568980167196]
Keypoints + Instructions to Execution (KITE) is a two-step framework for semantic manipulation.
It first grounds an input instruction in a visual scene through 2D image keypoints.
KITE then executes a learned keypoint-conditioned skill to carry out the instruction.
arXiv Detail & Related papers (2023-06-29T00:12:21Z) - Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions
with Large Language Model [63.66204449776262]
Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks.
Our approach is adjustable and flexible in accommodating various instruction modalities and input types.
Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
arXiv Detail & Related papers (2023-05-18T17:59:49Z) - Foldsformer: Learning Sequential Multi-Step Cloth Manipulation With
Space-Time Attention [4.2940878152791555]
We present a novel multi-step cloth manipulation planning framework named Foldformer.
We experimentally evaluate Foldsformer on four representative sequential multi-step manipulation tasks.
Our approach can be transferred from simulation to the real world without additional training or domain randomization.
arXiv Detail & Related papers (2023-01-08T09:15:45Z) - Inferring Versatile Behavior from Demonstrations by Matching Geometric
Descriptors [72.62423312645953]
Humans intuitively solve tasks in versatile ways, varying their behavior in terms of trajectory-based planning and for individual steps.
Current Imitation Learning algorithms often only consider unimodal expert demonstrations and act in a state-action-based setting.
Instead, we combine a mixture of movement primitives with a distribution matching objective to learn versatile behaviors that match the expert's behavior and versatility.
arXiv Detail & Related papers (2022-10-17T16:42:59Z) - USEEK: Unsupervised SE(3)-Equivariant 3D Keypoints for Generalizable
Manipulation [19.423310410631085]
U.S.EEK is an unsupervised SE(3)-equivariant keypoints method that enjoys alignment across instances in a category.
With USEEK in hand, the robot can infer the category-level task-relevant object frames in an efficient and explainable manner.
arXiv Detail & Related papers (2022-09-28T06:42:29Z) - CLIPort: What and Where Pathways for Robotic Manipulation [35.505615833638124]
We present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding with the spatial precision of Transporter.
Our framework is capable of solving a variety of language-specified tabletop tasks without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures.
arXiv Detail & Related papers (2021-09-24T17:44:28Z) - ManiSkill: Learning-from-Demonstrations Benchmark for Generalizable
Manipulation Skills [27.214053107733186]
We propose SAPIEN Manipulation Skill Benchmark (abbreviated as ManiSkill) for learning generalizable object manipulation skills.
ManiSkill supports object-level variations by utilizing a rich and diverse set of articulated objects.
ManiSkill can encourage the robot learning community to explore more on learning generalizable object manipulation skills.
arXiv Detail & Related papers (2021-07-30T08:20:22Z) - S3K: Self-Supervised Semantic Keypoints for Robotic Manipulation via
Multi-View Consistency [11.357804868755155]
We advocate semantic 3D keypoints as a visual representation, and present a semi-supervised training objective.
Unlike local texture-based approaches, our model integrates contextual information from a large area.
We demonstrate that this ability to locate semantic keypoints enables high level scripting of human understandable behaviours.
arXiv Detail & Related papers (2020-09-30T14:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.