FoldNet: Learning Generalizable Closed-Loop Policy for Garment Folding via Keypoint-Driven Asset and Demonstration Synthesis
- URL: http://arxiv.org/abs/2505.09109v1
- Date: Wed, 14 May 2025 03:34:30 GMT
- Title: FoldNet: Learning Generalizable Closed-Loop Policy for Garment Folding via Keypoint-Driven Asset and Demonstration Synthesis
- Authors: Yuxing Chen, Bowen Xiao, He Wang,
- Abstract summary: We present a synthetic garment dataset that can be used for robotic garment folding.<n>We generate folding demonstrations in simulation and train folding policies via closed-loop imitation learning.<n> KG-DAgger significantly improves the model performance, boosting the real-world success rate by 25%.
- Score: 9.22657317122778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the deformability of garments, generating a large amount of high-quality data for robotic garment manipulation tasks is highly challenging. In this paper, we present a synthetic garment dataset that can be used for robotic garment folding. We begin by constructing geometric garment templates based on keypoints and applying generative models to generate realistic texture patterns. Leveraging these keypoint annotations, we generate folding demonstrations in simulation and train folding policies via closed-loop imitation learning. To improve robustness, we propose KG-DAgger, which uses a keypoint-based strategy to generate demonstration data for recovering from failures. KG-DAgger significantly improves the model performance, boosting the real-world success rate by 25\%. After training with 15K trajectories (about 2M image-action pairs), the model achieves a 75\% success rate in the real world. Experiments in both simulation and real-world settings validate the effectiveness of our proposed framework.
Related papers
- HAD: Hybrid Architecture Distillation Outperforms Teacher in Genomic Sequence Modeling [52.58723853697152]
We propose a Hybrid Architecture Distillation (HAD) approach for DNA sequence modeling.<n>We employ the NTv2-500M as the teacher model and devise a grouping masking strategy.<n>Compared to models with similar parameters, our model achieved excellent performance.
arXiv Detail & Related papers (2025-05-27T07:57:35Z) - Real-Time Manipulation Action Recognition with a Factorized Graph Sequence Encoder [0.6437284704257459]
We present a new Factorized Graph Sequence network that runs in real-time and scales effectively in the temporal dimension.<n>We also introduce Hand Pooling operation, a simple pooling operation for more focused extraction of the graph-level embeddings.<n>Our model outperforms the previous state-of-the-art real-time approach, achieving a 14.3% and 5.6% improvement in F1-macro score.
arXiv Detail & Related papers (2025-03-15T07:58:25Z) - GraphGarment: Learning Garment Dynamics for Bimanual Cloth Manipulation Tasks [7.4467523788133585]
GraphGarment is a novel approach that models garment dynamics based on robot control inputs.<n>We use graphs to represent the interactions between the robot end-effector and the garment.<n>We conduct four experiments using six types of garments to validate our approach in both simulation and real-world settings.
arXiv Detail & Related papers (2025-03-04T17:35:48Z) - DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data [61.62554324594797]
We propose DreamMask, which explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data.<n>In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods.<n>For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
arXiv Detail & Related papers (2025-01-03T19:00:00Z) - Scaling Mesh Generation via Compressive Tokenization [66.05639158028343]
We propose a compressive yet effective mesh representation, Blocked and Patchified Tokenization (BPT)
BPT compresses mesh sequences by employing block-wise indexing and patch aggregation, reducing their length by approximately 75% compared to the original sequences.
Empowered with the BPT, we have built a foundation mesh generative model training on scaled mesh data to support flexible control for point clouds and images.
arXiv Detail & Related papers (2024-11-11T14:30:35Z) - SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation [82.61572106180705]
This paper presents a unified approach using vision-language models (VLMs) to improve keypoint prediction across various garment categories.
We created a large-scale synthetic dataset using advanced simulation techniques, allowing scalable training without extensive real-world data.
Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates.
arXiv Detail & Related papers (2024-09-26T17:26:16Z) - Affordance-Guided Reinforcement Learning via Visual Prompting [51.361977466993345]
Keypoint-based Affordance Guidance for Improvements (KAGI) is a method leveraging rewards shaped by vision-language models (VLMs) for autonomous RL.<n>On real-world manipulation tasks specified by natural language descriptions, KAGI improves the sample efficiency of autonomous RL and enables successful task completion in 30K online fine-tuning steps.
arXiv Detail & Related papers (2024-07-14T21:41:29Z) - STORM: Efficient Stochastic Transformer based World Models for
Reinforcement Learning [82.03481509373037]
Recently, model-based reinforcement learning algorithms have demonstrated remarkable efficacy in visual input environments.
We introduce Transformer-based wORld Model (STORM), an efficient world model architecture that combines strong modeling and generation capabilities.
Storm achieves a mean human performance of $126.7%$ on the Atari $100$k benchmark, setting a new record among state-of-the-art methods.
arXiv Detail & Related papers (2023-10-14T16:42:02Z) - Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time.
Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP.
Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z) - Self-supervised Cloth Reconstruction via Action-conditioned Cloth
Tracking [18.288330275993328]
We propose a self-supervised method to finetune a mesh reconstruction model in the real world.
We show that we can improve the quality of the reconstructed mesh without requiring human annotations.
arXiv Detail & Related papers (2023-02-19T07:48:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.