Related papers: General-purpose Clothes Manipulation with Semantic Keypoints

Related papers

CLIP-Guided Adaptable Self-Supervised Learning for Human-Centric Visual Tasks [76.00315860962885]
We propose CLASP (CLIP-guided Adaptable Self-suPervised learning), a novel framework for unsupervised pre-training in human-centric visual tasks.<n> CLASP leverages the powerful vision-language model CLIP to generate both low-level (e.g., body parts) and high-level (e.g., attributes) semantic pseudo-labels.<n>MoE dynamically adapts feature extraction based on task-specific prompts, mitigating potential feature conflicts and enhancing transferability.
arXiv Detail & Related papers (2026-01-19T15:19:28Z)
Mash, Spread, Slice! Learning to Manipulate Object States via Visual Spatial Progress [53.723881111373736]
We present SPARTA, the first unified framework for the family of object state change manipulation tasks.<n>SPARTA integrates spatially progressing object change segmentation maps, a visual skill to perceive actionable vs. transformed regions, and dense rewards that capture incremental progress over time.<n>We validate SPARTA on a real robot for three challenging tasks across 10 diverse real-world objects.
arXiv Detail & Related papers (2025-09-28T23:56:07Z)
CLASP: General-Purpose Clothes Manipulation with Semantic Keypoints [21.09454149734247]
This paper presents CLothes mAnipulation with Semantic keyPoints (CLASP), which aims at general-purpose clothes manipulation.<n>The core idea of CLASP is semantic keypoints -- e.g., ''left sleeve'', ''right shoulder'' -- a sparse spatial-semantic representation that is salient for both perception and action.<n>CLASP uses semantic keypoints to bridge high-level task planning and low-level action execution.
arXiv Detail & Related papers (2025-07-26T15:43:25Z)
Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids [56.892520712892804]
We introduce a practical sim-to-real RL recipe that trains a humanoid robot to perform three dexterous manipulation tasks.<n>We demonstrate high success rates on unseen objects and robust, adaptive policy behaviors.
arXiv Detail & Related papers (2025-02-27T18:59:52Z)
GarmentLab: A Unified Simulation and Benchmark for Garment Manipulation [12.940189262612677]
GarmentLab is a content-rich benchmark and realistic simulation designed for deformable object and garment manipulation. Our benchmark encompasses a diverse range of garment types, robotic systems and manipulators. We evaluate state-of-the-art vision methods, reinforcement learning, and imitation learning approaches on these tasks.
arXiv Detail & Related papers (2024-11-02T10:09:08Z)
Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features. We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z)
SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation [82.61572106180705]
This paper presents a unified approach using vision-language models (VLMs) to improve keypoint prediction across various garment categories. We created a large-scale synthetic dataset using advanced simulation techniques, allowing scalable training without extensive real-world data. Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates.
arXiv Detail & Related papers (2024-09-26T17:26:16Z)
Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models [53.22792173053473]
We introduce an interactive robotic manipulation framework called Polaris. Polaris integrates perception and interaction by utilizing GPT-4 alongside grounded vision models. We propose a novel Synthetic-to-Real (Syn2Real) pose estimation pipeline.
arXiv Detail & Related papers (2024-08-15T06:40:38Z)
DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research. We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair. We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z)
SAM-E: Leveraging Visual Foundation Model with Sequence Imitation for Embodied Manipulation [62.58480650443393]
Segment Anything (SAM) is a vision-foundation model for generalizable scene understanding and sequence imitation. We develop a novel multi-channel heatmap that enables the prediction of the action sequence in a single pass.
arXiv Detail & Related papers (2024-05-30T00:32:51Z)
UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence [6.9061350009929185]
Garment manipulation is essential for future robots to accomplish home-assistant tasks. We leverage the property that, garments in a certain category have similar structures. We then learn the topological dense (point-level) visual correspondence among garments in the category level with different deformations.
arXiv Detail & Related papers (2024-05-11T04:18:41Z)
ManiPose: A Comprehensive Benchmark for Pose-aware Object Manipulation in Robotics [55.85916671269219]
This paper introduces ManiPose, a pioneering benchmark designed to advance the study of pose-varying manipulation tasks. A comprehensive dataset features geometrically consistent and manipulation-oriented 6D pose labels for 2936 real-world scanned rigid objects and 100 articulated objects. Our benchmark demonstrates notable advancements in pose estimation, pose-aware manipulation, and real-robot skill transfer.
arXiv Detail & Related papers (2024-03-20T07:48:32Z)
Learning Reusable Manipulation Strategies [86.07442931141634]
Humans demonstrate an impressive ability to acquire and generalize manipulation "tricks" We present a framework that enables machines to acquire such manipulation skills through a single demonstration and self-play. These learned mechanisms and samplers can be seamlessly integrated into standard task and motion planners.
arXiv Detail & Related papers (2023-11-06T17:35:42Z)
KITE: Keypoint-Conditioned Policies for Semantic Manipulation [40.63568980167196]
Keypoints + Instructions to Execution (KITE) is a two-step framework for semantic manipulation. It first grounds an input instruction in a visual scene through 2D image keypoints. KITE then executes a learned keypoint-conditioned skill to carry out the instruction.
arXiv Detail & Related papers (2023-06-29T00:12:21Z)
Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model [63.66204449776262]
Instruct2Act is a framework that maps multi-modal instructions to sequential actions for robotic manipulation tasks. Our approach is adjustable and flexible in accommodating various instruction modalities and input types. Our zero-shot method outperformed many state-of-the-art learning-based policies in several tasks.
arXiv Detail & Related papers (2023-05-18T17:59:49Z)
Foldsformer: Learning Sequential Multi-Step Cloth Manipulation With Space-Time Attention [4.2940878152791555]
We present a novel multi-step cloth manipulation planning framework named Foldformer. We experimentally evaluate Foldsformer on four representative sequential multi-step manipulation tasks. Our approach can be transferred from simulation to the real world without additional training or domain randomization.
arXiv Detail & Related papers (2023-01-08T09:15:45Z)
Inferring Versatile Behavior from Demonstrations by Matching Geometric Descriptors [72.62423312645953]
Humans intuitively solve tasks in versatile ways, varying their behavior in terms of trajectory-based planning and for individual steps. Current Imitation Learning algorithms often only consider unimodal expert demonstrations and act in a state-action-based setting. Instead, we combine a mixture of movement primitives with a distribution matching objective to learn versatile behaviors that match the expert's behavior and versatility.
arXiv Detail & Related papers (2022-10-17T16:42:59Z)
USEEK: Unsupervised SE(3)-Equivariant 3D Keypoints for Generalizable Manipulation [19.423310410631085]
U.S.EEK is an unsupervised SE(3)-equivariant keypoints method that enjoys alignment across instances in a category. With USEEK in hand, the robot can infer the category-level task-relevant object frames in an efficient and explainable manner.
arXiv Detail & Related papers (2022-09-28T06:42:29Z)
CLIPort: What and Where Pathways for Robotic Manipulation [35.505615833638124]
We present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding with the spatial precision of Transporter. Our framework is capable of solving a variety of language-specified tabletop tasks without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures.
arXiv Detail & Related papers (2021-09-24T17:44:28Z)
ManiSkill: Learning-from-Demonstrations Benchmark for Generalizable Manipulation Skills [27.214053107733186]
We propose SAPIEN Manipulation Skill Benchmark (abbreviated as ManiSkill) for learning generalizable object manipulation skills. ManiSkill supports object-level variations by utilizing a rich and diverse set of articulated objects. ManiSkill can encourage the robot learning community to explore more on learning generalizable object manipulation skills.
arXiv Detail & Related papers (2021-07-30T08:20:22Z)
S3K: Self-Supervised Semantic Keypoints for Robotic Manipulation via Multi-View Consistency [11.357804868755155]
We advocate semantic 3D keypoints as a visual representation, and present a semi-supervised training objective. Unlike local texture-based approaches, our model integrates contextual information from a large area. We demonstrate that this ability to locate semantic keypoints enables high level scripting of human understandable behaviours.
arXiv Detail & Related papers (2020-09-30T14:44:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.