Related papers: IFG: Internet-Scale Guidance for Functional Grasping Generation

IFG: Internet-Scale Guidance for Functional Grasping Generation

URL: http://arxiv.org/abs/2511.09558v1
Date: Thu, 13 Nov 2025 02:02:10 GMT
Title: IFG: Internet-Scale Guidance for Functional Grasping Generation
Authors: Ray Muxin Liu, Mingxuan Li, Kenneth Shaw, Deepak Pathak,
Abstract summary: Large Vision Models trained on internet-scale data have demonstrated strong capabilities in segmenting and semantically understanding object parts.<n>We leverage simulation with a force-closure grasping generation pipeline that understands local geometries of the hand and object in the scene.<n>By combining the global semantic understanding of internet-scale models with the geometric precision of a simulation-based locally-aware force-closure, our achieves high-performance semantic grasping without any manually collected training data.
Score: 33.591942326974824
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Vision Models trained on internet-scale data have demonstrated strong capabilities in segmenting and semantically understanding object parts, even in cluttered, crowded scenes. However, while these models can direct a robot toward the general region of an object, they lack the geometric understanding required to precisely control dexterous robotic hands for 3D grasping. To overcome this, our key insight is to leverage simulation with a force-closure grasping generation pipeline that understands local geometries of the hand and object in the scene. Because this pipeline is slow and requires ground-truth observations, the resulting data is distilled into a diffusion model that operates in real-time on camera point clouds. By combining the global semantic understanding of internet-scale models with the geometric precision of a simulation-based locally-aware force-closure, \our achieves high-performance semantic grasping without any manually collected training data. For visualizations of this please visit our website at https://ifgrasping.github.io/

Related papers

Mirage2Matter: A Physically Grounded Gaussian World Model from Video [87.9732484393686]
We present Simulate Anything, a graphics-driven world modeling and simulation framework.<n>Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS)<n>We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target.
arXiv Detail & Related papers (2026-01-24T07:43:57Z)
R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation [74.41728218960465]
We propose a real-to-real 3D data generation framework (R2RGen) that directly augments the pointcloud observation-action pairs to generate real-world data.<n>R2RGen substantially enhances data efficiency on extensive experiments and demonstrates strong potential for scaling and application on mobile manipulation.
arXiv Detail & Related papers (2025-10-09T17:55:44Z)
Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots [55.43376513158555]
Camera Depth Models (CDMs) are a simple plugin on daily-use depth cameras.<n>We develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern.<n>For the first time, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots.
arXiv Detail & Related papers (2025-09-02T17:29:38Z)
Optimizing Grasping in Legged Robots: A Deep Learning Approach to Loco-Manipulation [0.6533458718563319]
This paper presents a framework designed to enhance the grasping capabilities of quadrupeds equipped with arms.<n>We developed a pipeline within the Genesis simulation environment to generate a synthetic dataset of grasp attempts on common objects.<n>This dataset was used to train a custom CNN with a U-Net-like architecture that processes multi-modal input from an onboard RGB and depth cameras.<n>We validated the complete framework on a four-legged robot.
arXiv Detail & Related papers (2025-08-24T17:47:56Z)
Articulated Object Manipulation using Online Axis Estimation with SAM2-Based Tracking [57.942404069484134]
Articulated object manipulation requires precise object interaction, where the object's axis must be carefully considered.<n>Previous research employed interactive perception for manipulating articulated objects, but typically, open-loop approaches often suffer from overlooking the interaction dynamics.<n>We present a closed-loop pipeline integrating interactive perception with online axis estimation from segmented 3D point clouds.
arXiv Detail & Related papers (2024-09-24T17:59:56Z)
AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core. The 3D autodecoder framework embeds properties learned from the target dataset in the latent space. We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z)
NSLF-OL: Online Learning of Neural Surface Light Fields alongside Real-time Incremental 3D Reconstruction [0.76146285961466]
The paper proposes a novel Neural Surface Light Fields model that copes with the small range of view directions while producing a good result in unseen directions. Our model learns online the Neural Surface Light Fields (NSLF) aside from real-time 3D reconstruction with a sequential data stream as the shared input. In addition to online training, our model also provides real-time rendering after completing the data stream for visualization.
arXiv Detail & Related papers (2023-04-29T15:41:15Z)
Neural Scene Representation for Locomotion on Structured Terrain [56.48607865960868]
We propose a learning-based method to reconstruct the local terrain for a mobile robot traversing urban environments. Using a stream of depth measurements from the onboard cameras and the robot's trajectory, the estimates the topography in the robot's vicinity. We propose a 3D reconstruction model that faithfully reconstructs the scene, despite the noisy measurements and large amounts of missing data coming from the blind spots of the camera arrangement.
arXiv Detail & Related papers (2022-06-16T10:45:17Z)
Hindsight for Foresight: Unsupervised Structured Dynamics Models from Physical Interaction [24.72947291987545]
Key challenge for an agent learning to interact with the world is to reason about physical properties of objects. We propose a novel approach for modeling the dynamics of a robot's interactions directly from unlabeled 3D point clouds and images.
arXiv Detail & Related papers (2020-08-02T11:04:49Z)
Self-Supervised Object-in-Gripper Segmentation from Robotic Motions [27.915309216800125]
We propose a robust solution for learning to segment unknown objects grasped by a robot. We exploit motion and temporal cues in RGB video sequences. Our approach is fully self-supervised and independent of precise camera calibration, 3D models or potentially imperfect depth data.
arXiv Detail & Related papers (2020-02-11T15:44:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.