When Transformer Meets Robotic Grasping: Exploits Context for Efficient
Grasp Detection
- URL: http://arxiv.org/abs/2202.11911v1
- Date: Thu, 24 Feb 2022 05:47:25 GMT
- Title: When Transformer Meets Robotic Grasping: Exploits Context for Efficient
Grasp Detection
- Authors: Shaochen Wang, Zhangli Zhou, and Zhen Kan
- Abstract summary: We present a transformer-based architecture, namely TF-Grasp, for robotic grasp detection.
The framework has two elaborate designs making it well suitable for visual grasping tasks.
The TF-Grasp achieves superior results versus state-of-art grasping convolutional models.
- Score: 1.7013938542585925
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a transformer-based architecture, namely TF-Grasp,
for robotic grasp detection. The developed TF-Grasp framework has two elaborate
designs making it well suitable for visual grasping tasks. The first key design
is that we adopt the local window attention to capture local contextual
information and detailed features of graspable objects. Then, we apply the
cross window attention to model the long-term dependencies between distant
pixels. Object knowledge, environmental configuration, and relationships
between different visual entities are aggregated for subsequent grasp
detection. The second key design is that we build a hierarchical
encoder-decoder architecture with skip-connections, delivering shallow features
from encoder to decoder to enable a multi-scale feature fusion. Due to the
powerful attention mechanism, the TF-Grasp can simultaneously obtain the local
information (i.e., the contours of objects), and model long-term connections
such as the relationships between distinct visual concepts in clutter.
Extensive computational experiments demonstrate that the TF-Grasp achieves
superior results versus state-of-art grasping convolutional models and attain a
higher accuracy of 97.99% and 94.6% on Cornell and Jacquard grasping datasets,
respectively. Real-world experiments using a 7DoF Franka Emika Panda robot also
demonstrate its capability of grasping unseen objects in a variety of
scenarios. The code and pre-trained models will be available at
https://github.com/WangShaoSUN/grasp-transformer
Related papers
- CFPFormer: Feature-pyramid like Transformer Decoder for Segmentation and Detection [1.837431956557716]
Feature pyramids have been widely adopted in convolutional neural networks (CNNs) and transformers for tasks like medical image segmentation and object detection.
We propose a novel decoder block that integrates feature pyramids and transformers.
Our model achieves superior performance in detecting small objects compared to existing methods.
arXiv Detail & Related papers (2024-04-23T18:46:07Z) - MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet [55.16833099336073]
We propose to self-distill a Transformer-based UNet for medical image segmentation.
It simultaneously learns global semantic information and local spatial-detailed features.
Our MISSU achieves the best performance over previous state-of-the-art methods.
arXiv Detail & Related papers (2022-06-02T07:38:53Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - Probabilistic Tracking with Deep Factors [8.030212474745879]
We show how to use a deep feature encoding in conjunction with generative densities over the features in a factor-graph based, probabilistic tracking framework.
We present a likelihood model that combines a learned feature encoder with generative densities over them, both trained in a supervised manner.
arXiv Detail & Related papers (2021-12-02T21:31:51Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z) - TrTr: Visual Tracking with Transformer [29.415900191169587]
We propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture.
We design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor.
Our method performs favorably against state-of-the-art algorithms.
arXiv Detail & Related papers (2021-05-09T02:32:28Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.