Related papers: When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection

When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection

URL: http://arxiv.org/abs/2202.11911v1
Date: Thu, 24 Feb 2022 05:47:25 GMT
Title: When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection
Authors: Shaochen Wang, Zhangli Zhou, and Zhen Kan
Abstract summary: We present a transformer-based architecture, namely TF-Grasp, for robotic grasp detection. The framework has two elaborate designs making it well suitable for visual grasping tasks. The TF-Grasp achieves superior results versus state-of-art grasping convolutional models.
Score: 1.7013938542585925
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we present a transformer-based architecture, namely TF-Grasp, for robotic grasp detection. The developed TF-Grasp framework has two elaborate designs making it well suitable for visual grasping tasks. The first key design is that we adopt the local window attention to capture local contextual information and detailed features of graspable objects. Then, we apply the cross window attention to model the long-term dependencies between distant pixels. Object knowledge, environmental configuration, and relationships between different visual entities are aggregated for subsequent grasp detection. The second key design is that we build a hierarchical encoder-decoder architecture with skip-connections, delivering shallow features from encoder to decoder to enable a multi-scale feature fusion. Due to the powerful attention mechanism, the TF-Grasp can simultaneously obtain the local information (i.e., the contours of objects), and model long-term connections such as the relationships between distinct visual concepts in clutter. Extensive computational experiments demonstrate that the TF-Grasp achieves superior results versus state-of-art grasping convolutional models and attain a higher accuracy of 97.99% and 94.6% on Cornell and Jacquard grasping datasets, respectively. Real-world experiments using a 7DoF Franka Emika Panda robot also demonstrate its capability of grasping unseen objects in a variety of scenarios. The code and pre-trained models will be available at https://github.com/WangShaoSUN/grasp-transformer

Related papers

Dual-Perspective United Transformer for Object Segmentation in Optical Remote Sensing Images [38.942152581251165]
We propose a novel Dual-Perspective United Transformer (DPU-Former) with a unique structure designed to simultaneously integrate long-range dependencies and spatial details.<n>In particular, we design the global-local mixed attention, which captures diverse information through two perspectives.<n>We present a gated linear feed-forward network to increase the expressive ability.
arXiv Detail & Related papers (2025-06-27T02:40:48Z)
Hierarchical Graph Interaction Transformer with Dynamic Token Clustering for Camouflaged Object Detection [57.883265488038134]
We propose a hierarchical graph interaction network termed HGINet for camouflaged object detection. The network is capable of discovering imperceptible objects via effective graph interaction among the hierarchical tokenized features. Our experiments demonstrate the superior performance of HGINet compared to existing state-of-the-art methods.
arXiv Detail & Related papers (2024-08-27T12:53:25Z)
CFPFormer: Feature-pyramid like Transformer Decoder for Segmentation and Detection [1.837431956557716]
Feature pyramids have been widely adopted in convolutional neural networks (CNNs) and transformers for tasks like medical image segmentation and object detection. We propose a novel decoder block that integrates feature pyramids and transformers. Our model achieves superior performance in detecting small objects compared to existing methods.
arXiv Detail & Related papers (2024-04-23T18:46:07Z)
Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z)
MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks. Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z)
Probabilistic Tracking with Deep Factors [8.030212474745879]
We show how to use a deep feature encoding in conjunction with generative densities over the features in a factor-graph based, probabilistic tracking framework. We present a likelihood model that combines a learned feature encoder with generative densities over them, both trained in a supervised manner.
arXiv Detail & Related papers (2021-12-02T21:31:51Z)
Cross-Modality Fusion Transformer for Multispectral Object Detection [0.0]
Multispectral image pairs can provide the combined information, making object detection applications more reliable and robust. We present a simple yet effective cross-modality feature fusion approach, named Cross-Modality Fusion Transformer (CFT) in this paper.
arXiv Detail & Related papers (2021-10-30T15:34:12Z)
Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z)
TrTr: Visual Tracking with Transformer [29.415900191169587]
We propose a novel tracker network based on a powerful attention mechanism called Transformer encoder-decoder architecture. We design the classification and regression heads using the output of Transformer to localize target based on shape-agnostic anchor. Our method performs favorably against state-of-the-art algorithms.
arXiv Detail & Related papers (2021-05-09T02:32:28Z)
Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD) It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.