Related papers: RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping

RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping

URL: http://arxiv.org/abs/2507.23734v1
Date: Thu, 31 Jul 2025 17:17:05 GMT
Title: RAGNet: Large-scale Reasoning-based Affordance Segmentation Benchmark towards General Grasping
Authors: Dongming Wu, Yanping Fu, Saike Huang, Yingfei Liu, Fan Jia, Nian Liu, Feng Dai, Tiancai Wang, Rao Muhammad Anwer, Fahad Shahbaz Khan, Jianbing Shen,
Abstract summary: We build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet.<n>The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data.<n>We propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target.
Score: 101.22617426879079
License: http://creativecommons.org/licenses/by/4.0/
Abstract: General robotic grasping systems require accurate object affordance perception in diverse open-world scenarios following human instructions. However, current studies suffer from the problem of lacking reasoning-based large-scale affordance prediction data, leading to considerable concern about open-world effectiveness. To address this limitation, we build a large-scale grasping-oriented affordance segmentation benchmark with human-like instructions, named RAGNet. It contains 273k images, 180 categories, and 26k reasoning instructions. The images cover diverse embodied data domains, such as wild, robot, ego-centric, and even simulation data. They are carefully annotated with an affordance map, while the difficulty of language instructions is largely increased by removing their category name and only providing functional descriptions. Furthermore, we propose a comprehensive affordance-based grasping framework, named AffordanceNet, which consists of a VLM pre-trained on our massive affordance data and a grasping network that conditions an affordance map to grasp the target. Extensive experiments on affordance segmentation benchmarks and real-robot manipulation tasks show that our model has a powerful open-world generalization ability. Our data and code is available at https://github.com/wudongming97/AffordanceNet.

Related papers

From Ideal to Real: Unified and Data-Efficient Dense Prediction for Real-World Scenarios [12.06521067086988]
We propose DenseDiT, which exploits generative models' visual priors to perform diverse real-world dense prediction tasks.<n>DenseDiT achieves superior results using less than 0.01% training data of baselines, underscoring its practical value for real-world deployment.
arXiv Detail & Related papers (2025-06-25T09:40:50Z)
RemoteSAM: Towards Segment Anything for Earth Observation [29.707796048411705]
We aim to develop a robust yet flexible visual foundation model for Earth observation.<n>It should possess strong capabilities in recognizing and localizing diverse visual targets.<n>We present RemoteSAM, a foundation model that establishes new SoTA on several earth observation perception benchmarks.
arXiv Detail & Related papers (2025-05-23T15:27:57Z)
HAMSTER: Hierarchical Action Models For Open-World Robot Manipulation [54.03004125910057]
We show that hierarchical vision-language-action models can be more effective in utilizing off-domain data than standard monolithic VLA models.<n>We show that, with the hierarchical design, the high-level VLM can transfer across significant domain gaps between the off-domain finetuning data and real-robot testing scenarios.
arXiv Detail & Related papers (2025-02-08T07:50:22Z)
Chameleon: A Data-Efficient Generalist for Dense Visual Prediction in the Wild [32.33035216140421]
Large language models have evolved data-efficient generalists, benefiting from the universal language interface and large-scale pre-training.<n>However, constructing a data-efficient generalist for dense visual prediction presents a distinct challenge due to the variation in label structures across different tasks.<n>In this study, we explore a universal model that can flexibly adapt to unseen dense label structures with a few examples.<n>We evaluate our model across a spectrum of unseen real-world scenarios where low-shot learning is desirable, including video, 3D, medical, biological, and user-interactive tasks.
arXiv Detail & Related papers (2024-04-29T06:35:34Z)
On-Device Domain Generalization [93.79736882489982]
Domain generalization is critical to on-device machine learning applications. We find that knowledge distillation is a strong candidate for solving the problem. We propose a simple idea called out-of-distribution knowledge distillation (OKD), which aims to teach the student how the teacher handles (synthetic) out-of-distribution data.
arXiv Detail & Related papers (2022-09-15T17:59:31Z)
MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis. The proposed dataset contains 100,000 images and 25 different object types. We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z)
Large-scale Unsupervised Semantic Segmentation [163.3568726730319]
We propose a new problem of large-scale unsupervised semantic segmentation (LUSS) with a newly created benchmark dataset to track the research progress. Based on the ImageNet dataset, we propose the ImageNet-S dataset with 1.2 million training images and 40k high-quality semantic segmentation annotations for evaluation.
arXiv Detail & Related papers (2021-06-06T15:02:11Z)
Graph Backdoor [53.70971502299977]
We present GTA, the first backdoor attack on graph neural networks (GNNs) GTA departs in significant ways: it defines triggers as specific subgraphs, including both topological structures and descriptive features. It can be instantiated for both transductive (e.g., node classification) and inductive (e.g., graph classification) tasks.
arXiv Detail & Related papers (2020-06-21T19:45:30Z)
Learning Cross-domain Generalizable Features by Representation Disentanglement [11.74643883335152]
Deep learning models exhibit limited generalizability across different domains. We propose Mutual-Information-based Disentangled Neural Networks (MIDNet) to extract generalizable features that enable transferring knowledge to unseen categorical features in target domains. We demonstrate our method on handwritten digits datasets and a fetal ultrasound dataset for image classification tasks.
arXiv Detail & Related papers (2020-02-29T17:53:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.