Related papers: A Two-stage Fine-tuning Strategy for Generalizable Manipulation Skill of Embodied AI

A Two-stage Fine-tuning Strategy for Generalizable Manipulation Skill of Embodied AI

URL: http://arxiv.org/abs/2307.11343v1
Date: Fri, 21 Jul 2023 04:15:36 GMT
Title: A Two-stage Fine-tuning Strategy for Generalizable Manipulation Skill of Embodied AI
Authors: Fang Gao, XueTao Li, Jun Yu, Feng Shaung
Abstract summary: We propose a novel two-stage fine-tuning strategy to enhance the generalization capability of our model based on the Maniskill2 benchmark. Our findings highlight the potential of our method to improve the generalization abilities of Embodied AI models and pave the way for their ractical applications in real-world scenarios.
Score: 15.480968464853769
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The advent of Chat-GPT has led to a surge of interest in Embodied AI. However, many existing Embodied AI models heavily rely on massive interactions with training environments, which may not be practical in real-world situations. To this end, the Maniskill2 has introduced a full-physics simulation benchmark for manipulating various 3D objects. This benchmark enables agents to be trained using diverse datasets of demonstrations and evaluates their ability to generalize to unseen scenarios in testing environments. In this paper, we propose a novel two-stage fine-tuning strategy that aims to further enhance the generalization capability of our model based on the Maniskill2 benchmark. Through extensive experiments, we demonstrate the effectiveness of our approach by achieving the 1st prize in all three tracks of the ManiSkill2 Challenge. Our findings highlight the potential of our method to improve the generalization abilities of Embodied AI models and pave the way for their ractical applications in real-world scenarios. All codes and models of our solution is available at https://github.com/xtli12/GXU-LIPE.git

Related papers

AIGI-Holmes: Towards Explainable and Generalizable AI-Generated Image Detection via Multimodal Large Language Models [78.08374249341514]
The rapid development of AI-generated content (AIGC) has led to the misuse of AI-generated images (AIGI) in spreading misinformation.<n>We introduce a large-scale and comprehensive dataset, Holmes-Set, which includes an instruction-tuning dataset with explanations on whether images are AI-generated.<n>Our work introduces an efficient data annotation method called the Multi-Expert Jury, enhancing data generation through structured MLLM explanations and quality control.<n>In addition, we propose Holmes Pipeline, a meticulously designed three-stage training framework comprising visual expert pre-training, supervised fine-tuning, and direct preference optimization
arXiv Detail & Related papers (2025-07-03T14:26:31Z)
Dyna-Think: Synergizing Reasoning, Acting, and World Model Simulation in AI Agents [76.86311820866153]
We propose Dyna-Think, a thinking framework that integrates planning with an internal world model with reasoning and acting to enhance AI agent performance.<n>DIT reconstructs the thinking process of R1 to focus on performing world model simulation relevant to the proposed (and planned) action, and trains the policy using this reconstructed data.<n>DDT uses a two-stage training process to first improve the agent's world modeling ability via objectives such as state prediction or critique generation, and then improve the agent's action via policy training.
arXiv Detail & Related papers (2025-05-31T00:10:18Z)
Boosting Virtual Agent Learning and Reasoning: A Step-wise, Multi-dimensional, and Generalist Reward Model with Benchmark [72.46357004059661]
We propose Similar, a step-wise Multi-dimensional Generalist Reward Model. It offers fine-grained signals for agent training and can choose better action for inference-time scaling. We introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation.
arXiv Detail & Related papers (2025-03-24T13:30:47Z)
VTAO-BiManip: Masked Visual-Tactile-Action Pre-training with Object Understanding for Bimanual Dexterous Manipulation [8.882764358932276]
Bimanual dexterous manipulation remains significant challenges in robotics due to the high DoFs of each hand and their coordination. Existing single-hand manipulation techniques often leverage human demonstrations to guide RL methods but fail to generalize to complex bimanual tasks involving multiple sub-skills. We introduce VTAO-BiManip, a novel framework that combines visual-tactile-action pretraining with object understanding to facilitate curriculum RL to enable human-like bimanual manipulation.
arXiv Detail & Related papers (2025-01-07T08:14:53Z)
Learning Generalizable 3D Manipulation With 10 Demonstrations [16.502781729164973]
We present a novel framework that learns manipulation skills from as few as 10 demonstrations. We validate our framework through extensive experiments on both simulation benchmarks and real-world robotic systems. This work shows significant potential for advancing efficient, generalizable manipulation skill learning in real-world applications.
arXiv Detail & Related papers (2024-11-15T14:01:02Z)
DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning [38.749045283035365]
We present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. We evaluate DINO-WM across various domains, including maze navigation, tabletop pushing, and particle manipulation.
arXiv Detail & Related papers (2024-11-07T18:54:37Z)
EVA: An Embodied World Model for Future Video Anticipation [30.721105710709008]
Video generation models have made significant progress in simulating future states, showcasing their potential as world simulators in embodied scenarios.<n>Existing models often lack robust understanding, limiting their ability to perform multi-step predictions or handle Out-of-Distribution (OOD) scenarios.<n>We propose the Reflection of Generation (RoG), a set of intermediate reasoning strategies designed to enhance video prediction.
arXiv Detail & Related papers (2024-10-20T18:24:00Z)
Learning the Generalizable Manipulation Skills on Soft-body Tasks via Guided Self-attention Behavior Cloning Policy [9.345203561496552]
GP2E behavior cloning policy can guide the agent to learn the generalizable manipulation skills from soft-body tasks. Our findings highlight the potential of our method to improve the generalization abilities of Embodied AI models.
arXiv Detail & Related papers (2024-10-08T07:31:10Z)
Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations [77.31328397965653]
We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting challenges through two key innovations. A novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability. An agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object.
arXiv Detail & Related papers (2024-04-26T16:40:17Z)
GenH2R: Learning Generalizable Human-to-Robot Handover via Scalable Simulation, Demonstration, and Imitation [31.702907860448477]
GenH2R is a framework for learning generalizable vision-based human-to-robot (H2R) handover skills. We acquire such generalizability by learning H2R handover at scale with a comprehensive solution. We leverage large-scale 3D model repositories, dexterous grasp generation methods, and curve-based 3D animation.
arXiv Detail & Related papers (2024-01-01T18:20:43Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models. Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning. Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation [55.485985317538194]
ProcTHOR is a framework for procedural generation of Embodied AI environments. We demonstrate state-of-the-art results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation.
arXiv Detail & Related papers (2022-06-14T17:09:35Z)
Silver-Bullet-3D at ManiSkill 2021: Learning-from-Demonstrations and Heuristic Rule-based Methods for Object Manipulation [118.27432851053335]
This paper presents an overview and comparative analysis of our systems designed for the following two tracks in SAPIEN ManiSkill Challenge 2021: No Interaction Track. The No Interaction track targets for learning policies from pre-collected demonstration trajectories. In this track, we design a Heuristic Rule-based Method (HRM) to trigger high-quality object manipulation by decomposing the task into a series of sub-tasks. For each sub-task, the simple rule-based controlling strategies are adopted to predict actions that can be applied to robotic arms.
arXiv Detail & Related papers (2022-06-13T16:20:42Z)
Demonstration-efficient Inverse Reinforcement Learning in Procedurally Generated Environments [137.86426963572214]
Inverse Reinforcement Learning can extrapolate reward functions from expert demonstrations. We show that our approach, DE-AIRL, is demonstration-efficient and still able to extrapolate reward functions which generalize to the fully procedural domain.
arXiv Detail & Related papers (2020-12-04T11:18:02Z)
Forgetful Experience Replay in Hierarchical Reinforcement Learning from Demonstrations [55.41644538483948]
In this paper, we propose a combination of approaches that allow the agent to use low-quality demonstrations in complex vision-based environments. Our proposed goal-oriented structuring of replay buffer allows the agent to automatically highlight sub-goals for solving complex hierarchical tasks in demonstrations. The solution based on our algorithm beats all the solutions for the famous MineRL competition and allows the agent to mine a diamond in the Minecraft environment.
arXiv Detail & Related papers (2020-06-17T15:38:40Z)
Triple-GAIL: A Multi-Modal Imitation Learning Framework with Generative Adversarial Nets [34.17829944466169]
Triple-GAIL is able to learn skill selection and imitation jointly from both expert demonstrations and continuously generated experiences with data augmentation purpose. Experiments on real driver trajectories and real-time strategy game datasets demonstrate that Triple-GAIL can better fit multi-modal behaviors close to the demonstrators.
arXiv Detail & Related papers (2020-05-19T03:24:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.