ClickDiff: Click to Induce Semantic Contact Map for Controllable Grasp Generation with Diffusion Models
- URL: http://arxiv.org/abs/2407.19370v1
- Date: Sun, 28 Jul 2024 02:42:29 GMT
- Title: ClickDiff: Click to Induce Semantic Contact Map for Controllable Grasp Generation with Diffusion Models
- Authors: Peiming Li, Ziyi Wang, Mengyuan Liu, Hong Liu, Chen Chen,
- Abstract summary: ClickDiff is a controllable conditional generation model that leverages a fine-grained Semantic Contact Map.
Within this framework, the Semantic Conditional Module generates reasonable contact maps based on fine-grained contact information.
We evaluate the validity of our proposed method, demonstrating the efficacy and robustness of ClickDiff, even with previously unseen objects.
- Score: 17.438429495623755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grasp generation aims to create complex hand-object interactions with a specified object. While traditional approaches for hand generation have primarily focused on visibility and diversity under scene constraints, they tend to overlook the fine-grained hand-object interactions such as contacts, resulting in inaccurate and undesired grasps. To address these challenges, we propose a controllable grasp generation task and introduce ClickDiff, a controllable conditional generation model that leverages a fine-grained Semantic Contact Map (SCM). Particularly when synthesizing interactive grasps, the method enables the precise control of grasp synthesis through either user-specified or algorithmically predicted Semantic Contact Map. Specifically, to optimally utilize contact supervision constraints and to accurately model the complex physical structure of hands, we propose a Dual Generation Framework. Within this framework, the Semantic Conditional Module generates reasonable contact maps based on fine-grained contact information, while the Contact Conditional Module utilizes contact maps alongside object point clouds to generate realistic grasps. We evaluate the evaluation criteria applicable to controllable grasp generation. Both unimanual and bimanual generation experiments on GRAB and ARCTIC datasets verify the validity of our proposed method, demonstrating the efficacy and robustness of ClickDiff, even with previously unseen objects. Our code is available at https://github.com/adventurer-w/ClickDiff.
Related papers
- Exploiting Contextual Target Attributes for Target Sentiment
Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task.
We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z) - Fine-grained Controllable Video Generation via Object Appearance and
Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control.
FACTOR aims to control objects' appearances and context, including their location and category.
Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z) - ContactGen: Generative Contact Modeling for Grasp Generation [37.56729700157981]
This paper presents a novel object-centric contact representation ContactGen for hand-object interaction.
We propose a conditional generative model to predict ContactGen and adopt model-based optimization to predict diverse and geometrically feasible grasps.
arXiv Detail & Related papers (2023-10-05T17:59:45Z) - SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form
Layout-to-Image Generation [68.42476385214785]
We propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance.
SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works.
We also propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms.
arXiv Detail & Related papers (2023-08-20T04:09:12Z) - Integrated Object Deformation and Contact Patch Estimation from
Visuo-Tactile Feedback [8.420670642409219]
We propose a representation that jointly models object deformations and contact patches from visuo-tactile feedback.
We propose a neural network architecture to learn a NDCF, and train it using simulated data.
We demonstrate that the learned NDCF transfers directly to the real-world without the need for fine-tuning.
arXiv Detail & Related papers (2023-05-23T18:53:24Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Contact2Grasp: 3D Grasp Synthesis via Hand-Object Contact Constraint [18.201389966034263]
3D grasp synthesis generates grasping poses given an input object.
We introduce an intermediate variable for grasp contact areas to constrain the grasp generation.
Our method outperforms state-of-the-art methods regarding grasp generation on various metrics.
arXiv Detail & Related papers (2022-10-17T16:39:25Z) - S$^2$Contact: Graph-based Network for 3D Hand-Object Contact Estimation
with Semi-Supervised Learning [70.72037296392642]
We propose a novel semi-supervised framework that allows us to learn contact from monocular images.
Specifically, we leverage visual and geometric consistency constraints in large-scale datasets for generating pseudo-labels.
We show benefits from using a contact map that rules hand-object interactions to produce more accurate reconstructions.
arXiv Detail & Related papers (2022-08-01T14:05:23Z) - Hand-Object Contact Prediction via Motion-Based Pseudo-Labeling and
Guided Progressive Label Correction [27.87570749976023]
We introduce a video-based method for predicting contact between a hand and an object.
Annotating a large number of hand-object tracks and contact labels is costly.
We propose a semi-supervised framework consisting of (i) automatic collection of training data with motion-based pseudo-labels and (ii) guided progressive label correction (gPLC)
arXiv Detail & Related papers (2021-10-19T18:00:02Z) - Mutual Graph Learning for Camouflaged Object Detection [31.422775969808434]
A major challenge is that intrinsic similarities between foreground objects and background surroundings make the features extracted by deep model indistinguishable.
We design a novel Mutual Graph Learning model, which generalizes the idea of conventional mutual learning from regular grids to the graph domain.
In contrast to most mutual learning approaches that use a shared function to model all between-task interactions, MGL is equipped with typed functions for handling different complementary relations.
arXiv Detail & Related papers (2021-04-03T10:14:39Z) - Relational Message Passing for Knowledge Graph Completion [78.47976646383222]
We propose a relational message passing method for knowledge graph completion.
It passes relational messages among edges iteratively to aggregate neighborhood information.
Results show our method outperforms stateof-the-art knowledge completion methods by a large margin.
arXiv Detail & Related papers (2020-02-17T03:33:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.