CLASP: General-Purpose Clothes Manipulation with Semantic Keypoints
- URL: http://arxiv.org/abs/2507.19983v1
- Date: Sat, 26 Jul 2025 15:43:25 GMT
- Title: CLASP: General-Purpose Clothes Manipulation with Semantic Keypoints
- Authors: Yuhong Deng, Chao Tang, Cunjun Yu, Linfeng Li, David Hsu,
- Abstract summary: This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP), which aims at general-purpose clothes manipulation.<n>The core idea of CLASP is semantic keypoints -- e.g., ''left sleeve'', ''right shoulder'' -- a sparse spatial-semantic representation that is salient for both perception and action.<n>CLASP uses semantic keypoints to bridge high-level task planning and low-level action execution.
- Score: 21.09454149734247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Clothes manipulation, such as folding or hanging, is a critical capability for home service robots. Despite recent advances, most existing methods remain limited to specific tasks and clothes types, due to the complex, high-dimensional geometry of clothes. This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP), which aims at general-purpose clothes manipulation over different clothes types, T-shirts, shorts, skirts, long dresses, ... , as well as different tasks, folding, flattening, hanging, ... . The core idea of CLASP is semantic keypoints -- e.g., ''left sleeve'', ''right shoulder'', etc. -- a sparse spatial-semantic representation that is salient for both perception and action. Semantic keypoints of clothes can be reliably extracted from RGB-D images and provide an effective intermediate representation of clothes manipulation policies. CLASP uses semantic keypoints to bridge high-level task planning and low-level action execution. At the high level, it exploits vision language models (VLMs) to predict task plans over the semantic keypoints. At the low level, it executes the plans with the help of a simple pre-built manipulation skill library. Extensive simulation experiments show that CLASP outperforms state-of-the-art baseline methods on multiple tasks across diverse clothes types, demonstrating strong performance and generalization. Further experiments with a Franka dual-arm system on four distinct tasks -- folding, flattening, hanging, and placing -- confirm CLASP's performance on a real robot.
Related papers
- GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning [27.756766557197746]
Garment manipulation has attracted increasing attention due to its critical role in home-assistant robotics.<n>We propose a novel garment retrieval pipeline that can not only follow language instruction to execute safe and clean retrieval but also guarantee exactly one garment is retrieved per attempt.<n>Our pipeline seamlessly integrates vision-language reasoning with visual affordance perception, fully leveraging the high-level reasoning and planning capabilities of VLMs.
arXiv Detail & Related papers (2026-03-04T15:13:40Z) - CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks [76.00315860962885]
We propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework for unsupervised pre-training in human-centric visual tasks.<n> CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels.<n>MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability.
arXiv Detail & Related papers (2026-01-19T15:19:28Z) - DexGarmentLab: Dexterous Garment Manipulation Environment with Generalizable Policy [74.9519138296936]
Garment manipulation is a critical challenge due to the diversity in garment categories, geometries, and deformations.<n>We propose DexGarmentLab, the first environment specifically designed for dexterous (especially bimanual) garment manipulation.<n>It features large-scale high-quality 3D assets for 15 task scenarios, and refines simulation techniques tailored for garment modeling to reduce the sim-to-real gap.
arXiv Detail & Related papers (2025-05-16T09:26:59Z) - Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics.
Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features.
We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z) - SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation [82.61572106180705]
This paper presents a unified approach using vision-language models (VLMs) to improve keypoint prediction across various garment categories.
We created a large-scale synthetic dataset using advanced simulation techniques, allowing scalable training without extensive real-world data.
Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates.
arXiv Detail & Related papers (2024-09-26T17:26:16Z) - General-purpose Clothes Manipulation with Semantic Keypoints [17.23980132793002]
This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP) for general-purpose clothes manipulation.<n>The key idea of CLASP is semantic keypoints -- e.g., "right shoulder", "left sleeve", etc. -- a sparse spatial-semantic representation that is salient for both perception and action.<n> experiments with a Kinova dual-arm system on four distinct tasks -- folding, flattening, hanging, and placing -- confirm CLASP's performance on a real robot.
arXiv Detail & Related papers (2024-08-15T13:49:14Z) - CLIP-Driven Cloth-Agnostic Feature Learning for Cloth-Changing Person Re-Identification [47.948622774810296]
We propose a novel framework called CLIP-Driven Cloth-Agnostic Feature Learning (CCAF) for Cloth-Changing Person Re-Identification (CC-ReID)
Two modules were custom-designed: the Invariant Feature Prompting (IFP) and the Clothes Feature Minimization (CFM)
Experiments have demonstrated the effectiveness of the proposed CCAF, achieving new state-of-the-art performance on several popular CC-ReID benchmarks without any additional inference time.
arXiv Detail & Related papers (2024-06-13T14:56:07Z) - UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence [6.9061350009929185]
Garment manipulation is essential for future robots to accomplish home-assistant tasks.
We leverage the property that, garments in a certain category have similar structures.
We then learn the topological dense (point-level) visual correspondence among garments in the category level with different deformations.
arXiv Detail & Related papers (2024-05-11T04:18:41Z) - Aligning and Prompting Everything All at Once for Universal Visual
Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks.
APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection.
Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - KITE: Keypoint-Conditioned Policies for Semantic Manipulation [40.63568980167196]
Keypoints + Instructions to Execution (KITE) is a two-step framework for semantic manipulation.
It first grounds an input instruction in a visual scene through 2D image keypoints.
KITE then executes a learned keypoint-conditioned skill to carry out the instruction.
arXiv Detail & Related papers (2023-06-29T00:12:21Z) - Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions
with Large Language Model [63.66204449776262]
Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks.
Our approach is adjustable and flexible in accommodating various instruction modalities and input types.
Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
arXiv Detail & Related papers (2023-05-18T17:59:49Z) - Arbitrary Virtual Try-On Network: Characteristics Preservation and
Trade-off between Body and Clothing [85.74977256940855]
We propose an Arbitrary Virtual Try-On Network (AVTON) for all-type clothes.
AVTON can synthesize realistic try-on images by preserving and trading off characteristics of the target clothes and the reference person.
Our approach can achieve better performance compared with the state-of-the-art virtual try-on methods.
arXiv Detail & Related papers (2021-11-24T08:59:56Z) - CLIPort: What and Where Pathways for Robotic Manipulation [35.505615833638124]
We present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding with the spatial precision of Transporter.
Our framework is capable of solving a variety of language-specified tabletop tasks without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures.
arXiv Detail & Related papers (2021-09-24T17:44:28Z) - S3K: Self-Supervised Semantic Keypoints for Robotic Manipulation via
Multi-View Consistency [11.357804868755155]
We advocate semantic 3D keypoints as a visual representation, and present a semi-supervised training objective.
Unlike local texture-based approaches, our model integrates contextual information from a large area.
We demonstrate that this ability to locate semantic keypoints enables high level scripting of human understandable behaviours.
arXiv Detail & Related papers (2020-09-30T14:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.