Language-driven Grasp Detection with Mask-guided Attention
- URL: http://arxiv.org/abs/2407.19877v1
- Date: Mon, 29 Jul 2024 10:55:17 GMT
- Title: Language-driven Grasp Detection with Mask-guided Attention
- Authors: Tuan Van Vo, Minh Nhat Vu, Baoru Huang, An Vuong, Ngan Le, Thieu Vo, Anh Nguyen,
- Abstract summary: We propose a new method for language-driven grasp detection with mask-guided attention.
Our approach integrates visual data, segmentation mask features, and natural language instructions.
Our work introduces a new framework for language-driven grasp detection, paving the way for language-driven robotic applications.
- Score: 10.231956034184265
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Grasp detection is an essential task in robotics with various industrial applications. However, traditional methods often struggle with occlusions and do not utilize language for grasping. Incorporating natural language into grasp detection remains a challenging task and largely unexplored. To address this gap, we propose a new method for language-driven grasp detection with mask-guided attention by utilizing the transformer attention mechanism with semantic segmentation features. Our approach integrates visual data, segmentation mask features, and natural language instructions, significantly improving grasp detection accuracy. Our work introduces a new framework for language-driven grasp detection, paving the way for language-driven robotic applications. Intensive experiments show that our method outperforms other recent baselines by a clear margin, with a 10.0% success score improvement. We further validate our method in real-world robotic experiments, confirming the effectiveness of our approach.
Related papers
- Lightweight Language-driven Grasp Detection using Conditional Consistency Model [10.254392362201308]
We present a new approach for language-driven grasp detection that leverages the concept of lightweight diffusion models.
Our method can effectively encode visual and textual information, enabling more accurate and versatile grasp positioning.
We further validate our method in real-world robotic experiments to demonstrate its fast inference time capability.
arXiv Detail & Related papers (2024-07-25T11:39:20Z) - Language-Driven 6-DoF Grasp Detection Using Negative Prompt Guidance [13.246380364455494]
We present a new approach for language-driven 6-DoF grasp detection in cluttered point clouds.
The proposed negative prompt strategy directs the detection process toward the desired object while steering away from unwanted ones.
Our method enables an end-to-end framework where humans can command the robot to grasp desired objects in a cluttered scene using natural language.
arXiv Detail & Related papers (2024-07-18T18:24:51Z) - Language-driven Grasp Detection [12.78625719116471]
We introduce a new language-driven grasp detection dataset featuring 1M samples, over 3M objects, and upwards of 10M grasping instructions.
We propose a new language-driven grasp detection method based on diffusion models.
Our method outperforms state-of-the-art approaches and allows real-world robotic grasping.
arXiv Detail & Related papers (2024-06-13T16:06:59Z) - MENTOR: Multilingual tExt detectioN TOward leaRning by analogy [59.37382045577384]
We propose a framework to detect and identify both seen and unseen language regions inside scene images.
"MENTOR" is the first work to realize a learning strategy between zero-shot learning and few-shot learning for multilingual scene text detection.
arXiv Detail & Related papers (2024-03-12T03:35:17Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Sample Efficient Approaches for Idiomaticity Detection [6.481818246474555]
This work explores sample efficient methods of idiomaticity detection.
In particular, we study the impact of Pattern Exploit Training (PET), a few-shot method of classification, and BERTRAM, an efficient method of creating contextual embeddings.
Our experiments show that whilePET improves performance on English, they are much less effective on Portuguese and Galician, leading to an overall performance about on par with vanilla mBERT.
arXiv Detail & Related papers (2022-05-23T13:46:35Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - Exploring Sub-skeleton Trajectories for Interpretable Recognition of
Sign Language [2.1178416840822027]
We study the problem of accurately recognizing sign language words.
Our method explores a geometric feature space that we call sub-skeleton' aspects of movement.
Surprisingly, our simple methods improve sign recognition over recent, state-of-the-art approaches.
arXiv Detail & Related papers (2022-02-03T03:32:28Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z) - Discriminative Nearest Neighbor Few-Shot Intent Detection by
Transferring Natural Language Inference [150.07326223077405]
Few-shot learning is attracting much attention to mitigate data scarcity.
We present a discriminative nearest neighbor classification with deep self-attention.
We propose to boost the discriminative ability by transferring a natural language inference (NLI) model.
arXiv Detail & Related papers (2020-10-25T00:39:32Z) - Kungfupanda at SemEval-2020 Task 12: BERT-Based Multi-Task Learning for
Offensive Language Detection [55.445023584632175]
We build an offensive language detection system, which combines multi-task learning with BERT-based models.
Our model achieves 91.51% F1 score in English Sub-task A, which is comparable to the first place.
arXiv Detail & Related papers (2020-04-28T11:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.