Related papers: Sim-Grasp: Learning 6-DOF Grasp Policies for Cluttered Environments Using a Synthetic Benchmark

Related papers

Articulated 3D Scene Graphs for Open-World Mobile Manipulation [55.97942733699124]
We present MoMa-SG, a framework for building semantic-kinematic 3D scene graphs of articulated scenes.<n>We estimate articulation models using a novel unified twist estimation formulation.<n>We also introduce the novel Arti4D-Semantic dataset.
arXiv Detail & Related papers (2026-02-18T10:40:35Z)
SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes [19.995619927680476]
SceneSmith builds environments from architectural layout to natural furniture population.<n>SceneSmith generates more objects than prior methods, with 2% interobject collisions and 96% of objects remaining stable under physics simulation.<n>SceneSmith environments can be used in an end-to-end pipeline for automatic policy evaluation.
arXiv Detail & Related papers (2026-02-09T19:56:04Z)
URDF-Anything: Constructing Articulated Objects with 3D Multimodal Language Model [76.08429266631823]
We propose an end-to-end automatic reconstruction framework based on a 3D multimodal large language model (MLLM)<n>URDF-Anything utilizes an autoregressive prediction framework based on point-cloud and text multimodal input to jointly optimize geometric segmentation and kinematic parameter prediction.<n> Experiments on both simulated and real-world datasets demonstrate that our method significantly outperforms existing approaches.
arXiv Detail & Related papers (2025-11-02T13:45:51Z)
RoboTwin 2.0: A Scalable Data Generator and Benchmark with Strong Domain Randomization for Robust Bimanual Robotic Manipulation [51.86515213749527]
We present RoboTwin 2.0, a scalable simulation framework that enables automated, large-scale generation of diverse and realistic data.<n>To improve sim-to-real transfer, RoboTwin 2.0 incorporates structured domain randomization along five axes: clutter, lighting, background, tabletop height and language instructions.<n>We instantiate this framework across 50 dual-arm tasks spanning five robot embodiments, and pre-collect over 100,000 domain-randomized expert trajectories.
arXiv Detail & Related papers (2025-06-22T16:26:53Z)
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models [59.0256377330646]
Lens is a benchmark with 3.4K contemporary images and 60K+ human-authored questions covering eight tasks and 12 daily scenarios.<n>This dataset intrinsically supports to evaluate MLLMs to handle image-invariable prompts, from basic perception to compositional reasoning.<n>We evaluate 15+ frontier MLLMs such as Qwen2.5-VL-72B, InternVL3-78B, GPT-4o and two reasoning models QVQ-72B-preview and Kimi-VL.
arXiv Detail & Related papers (2025-05-21T15:06:59Z)
GraspClutter6D: A Large-scale Real-world Dataset for Robust Perception and Grasping in Cluttered Scenes [5.289647064481469]
We present GraspClutter6D, a large-scale real-world grasping dataset featuring 1,000 cluttered scenes with dense arrangements. We benchmark state-of-the-art segmentation, object pose estimation, and grasping detection methods to provide key insights into challenges in cluttered environments. We validate the dataset's effectiveness as a training resource, demonstrating that grasping networks trained on GraspClutter6D significantly outperform those trained on existing datasets in both simulation and real-world experiments.
arXiv Detail & Related papers (2025-04-09T13:15:46Z)
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction. We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images. We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z)
Local Policies Enable Zero-shot Long-horizon Manipulation [80.1161776000682]
We introduce ManipGen, which leverages a new class of policies for sim2real transfer: local policies. ManipGen outperforms SOTA approaches such as SayCan, OpenVLA, LLMTrajGen and VoxPoser across 50 real-world manipulation tasks by 36%, 76%, 62% and 60% respectively.
arXiv Detail & Related papers (2024-10-29T17:59:55Z)
Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps [16.083092305930844]
Open-Vocabulary Mobile Manipulation (OVMM) is a crucial capability for autonomous robots. We propose a novel framework that leverages the zero-shot detection and grounded recognition capabilities. We have built a 10-DoF mobile manipulation robotic platform JSR-1 and demonstrated in real-world robot experiments.
arXiv Detail & Related papers (2024-06-26T07:06:42Z)
Learning from Models and Data for Visual Grounding [55.21937116752679]
We introduce SynGround, a framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models. We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention objective. The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model.
arXiv Detail & Related papers (2024-03-20T17:59:43Z)
AGILE: Approach-based Grasp Inference Learned from Element Decomposition [2.812395851874055]
Humans can grasp objects by taking into account hand-object positioning information. This work proposes a method to enable a robot manipulator to learn the same, grasping objects in the most optimal way.
arXiv Detail & Related papers (2024-02-02T10:47:08Z)
DeMuX: Data-efficient Multilingual Learning [57.37123046817781]
DEMUX is a framework that prescribes exact data-points to label from vast amounts of unlabelled multilingual data. Our end-to-end framework is language-agnostic, accounts for model representations, and supports multilingual target configurations.
arXiv Detail & Related papers (2023-11-10T20:09:08Z)
Sim-Suction: Learning a Suction Grasp Policy for Cluttered Environments Using a Synthetic Benchmark [8.025760743074066]
Sim-Suction is a robust object-aware suction grasp policy for mobile manipulation platforms with dynamic camera viewpoints. Sim-Suction-Dataset comprises 500 cluttered environments with 3.2 million annotated suction grasp poses. Sim-Suction-Pointnet generates robust 6D suction grasp poses by learning point-wise affordances from the Sim-Suction-Dataset.
arXiv Detail & Related papers (2023-05-25T15:31:08Z)
Sim-MEES: Modular End-Effector System Grasping Dataset for Mobile Manipulators in Cluttered Environments [10.414347878456852]
We present a large-scale synthetic dataset that contains 1,550 objects with varying difficulty levels and physics properties, as well as 11 million grasp labels for mobile manipulators to plan grasps using different modalities in cluttered environments. Our dataset generation process combines analytic models and dynamic simulations of the entire cluttered environment to provide accurate grasp labels.
arXiv Detail & Related papers (2023-05-17T21:40:26Z)
Using Detection, Tracking and Prediction in Visual SLAM to Achieve Real-time Semantic Mapping of Dynamic Scenarios [70.70421502784598]
RDS-SLAM can build semantic maps at object level for dynamic scenarios in real time using only one commonly used Intel Core i7 CPU. We evaluate RDS-SLAM in TUM RGB-D dataset, and experimental results show that RDS-SLAM can run with 30.3 ms per frame in dynamic scenarios.
arXiv Detail & Related papers (2022-10-10T11:03:32Z)
Self-Supervised Interactive Object Segmentation Through a Singulation-and-Grasping Approach [9.029861710944704]
We propose a robot learning approach to interact with novel objects and collect each object's training label. The Singulation-and-Grasping (SaG) policy is trained through end-to-end reinforcement learning. Our system achieves 70% singulation success rate in simulated cluttered scenes.
arXiv Detail & Related papers (2022-07-19T15:01:36Z)
Sim-to-Real Transfer for Vision-and-Language Navigation [70.86250473583354]
We study the problem of releasing a robot in a previously unseen environment, and having it follow unconstrained natural language navigation instructions. Recent work on the task of Vision-and-Language Navigation (VLN) has achieved significant progress in simulation. To assess the implications of this work for robotics, we transfer a VLN agent trained in simulation to a physical robot.
arXiv Detail & Related papers (2020-11-07T16:49:04Z)
Meta-Sim2: Unsupervised Learning of Scene Structure for Synthetic Data Generation [88.04759848307687]
In Meta-Sim2, we aim to learn the scene structure in addition to parameters, which is a challenging problem due to its discrete nature. We use Reinforcement Learning to train our model, and design a feature space divergence between our synthesized and target images that is key to successful training. We also show that this leads to downstream improvement in the performance of an object detector trained on our generated dataset as opposed to other baseline simulation methods.
arXiv Detail & Related papers (2020-08-20T17:28:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.