Efficient Multi-Object Pose Estimation using Multi-Resolution Deformable
Attention and Query Aggregation
- URL: http://arxiv.org/abs/2312.08268v1
- Date: Wed, 13 Dec 2023 16:30:00 GMT
- Title: Efficient Multi-Object Pose Estimation using Multi-Resolution Deformable
Attention and Query Aggregation
- Authors: Arul Selvam Periyasamy, Vladimir Tsaturyan, Sven Behnke
- Abstract summary: We investigate incorporating inductive biases in vision transformer models for multi-object pose estimation.
We propose a query aggregation mechanism that enables increasing the number of object queries without increasing the computational complexity.
We evaluate the proposed model on the challenging YCB-Video dataset and report state-of-the-art results.
- Score: 19.995626376471765
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object pose estimation is a long-standing problem in computer vision.
Recently, attention-based vision transformer models have achieved
state-of-the-art results in many computer vision applications. Exploiting the
permutation-invariant nature of the attention mechanism, a family of vision
transformer models formulate multi-object pose estimation as a set prediction
problem. However, existing vision transformer models for multi-object pose
estimation rely exclusively on the attention mechanism. Convolutional neural
networks, on the other hand, hard-wire various inductive biases into their
architecture. In this paper, we investigate incorporating inductive biases in
vision transformer models for multi-object pose estimation, which facilitates
learning long-range dependencies while circumventing the costly global
attention. In particular, we use multi-resolution deformable attention, where
the attention operation is performed only between a few deformed reference
points. Furthermore, we propose a query aggregation mechanism that enables
increasing the number of object queries without increasing the computational
complexity. We evaluate the proposed model on the challenging YCB-Video dataset
and report state-of-the-art results.
Related papers
- Investigating the Role of Instruction Variety and Task Difficulty in Robotic Manipulation Tasks [50.75902473813379]
This work introduces a comprehensive evaluation framework that systematically examines the role of instructions and inputs in the generalisation abilities of such models.
The proposed framework uncovers the resilience of multimodal models to extreme instruction perturbations and their vulnerability to observational changes.
arXiv Detail & Related papers (2024-07-04T14:36:49Z) - LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models [50.259006481656094]
We present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models.
Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer.
We present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
arXiv Detail & Related papers (2024-04-03T23:57:34Z) - Solving the Clustering Reasoning Problems by Modeling a Deep-Learning-Based Probabilistic Model [1.7955614278088239]
We introduce PMoC, a deep-learning-based probabilistic model, achieving high reasoning accuracy in the Bongard-Logo.
As a bonus, we also designed Pose-Transformer for complex visual abstract reasoning tasks.
arXiv Detail & Related papers (2024-03-05T18:08:29Z) - OtterHD: A High-Resolution Multi-modality Model [57.16481886807386]
OtterHD-8B is an innovative multimodal model engineered to interpret high-resolution visual inputs with granular precision.
Our study highlights the critical role of flexibility and high-resolution input capabilities in large multimodal models.
arXiv Detail & Related papers (2023-11-07T18:59:58Z) - AttentionViz: A Global View of Transformer Attention [60.82904477362676]
We present a new visualization technique designed to help researchers understand the self-attention mechanism in transformers.
The main idea behind our method is to visualize a joint embedding of the query and key vectors used by transformer models to compute attention.
We create an interactive visualization tool, AttentionViz, based on these joint query-key embeddings.
arXiv Detail & Related papers (2023-05-04T23:46:49Z) - Multimodal Adaptive Fusion of Face and Gait Features using Keyless
attention based Deep Neural Networks for Human Identification [67.64124512185087]
Soft biometrics such as gait are widely used with face in surveillance tasks like person recognition and re-identification.
We propose a novel adaptive multi-biometric fusion strategy for the dynamic incorporation of gait and face biometric cues by leveraging keyless attention deep neural networks.
arXiv Detail & Related papers (2023-03-24T05:28:35Z) - Learning to reason over visual objects [6.835410768769661]
We investigate the extent to which a general-purpose mechanism for processing visual scenes in terms of objects might help promote abstract visual reasoning.
We find that an inductive bias for object-centric processing may be a key component of abstract visual reasoning.
arXiv Detail & Related papers (2023-03-03T23:19:42Z) - Multi-manifold Attention for Vision Transformers [12.862540139118073]
Vision Transformers are very popular nowadays due to their state-of-the-art performance in several computer vision tasks.
A novel attention mechanism, called multi-manifold multihead attention, is proposed in this work to substitute the vanilla self-attention of a Transformer.
arXiv Detail & Related papers (2022-07-18T12:53:53Z) - Vision Transformer with Convolutions Architecture Search [72.70461709267497]
We propose an architecture search method-Vision Transformer with Convolutions Architecture Search (VTCAS)
The high-performance backbone network searched by VTCAS introduces the desirable features of convolutional neural networks into the Transformer architecture.
It enhances the robustness of the neural network for object recognition, especially in the low illumination indoor scene.
arXiv Detail & Related papers (2022-03-20T02:59:51Z) - An empirical evaluation of attention-based multi-head models for
improved turbofan engine remaining useful life prediction [9.282239595143787]
A single unit (head) is the conventional input feature extractor in deep learning architectures trained on multivariate time series signals.
This work extends the conventional single-head deep learning models to a more robust form by developing context-specific heads.
arXiv Detail & Related papers (2021-09-04T01:13:47Z) - Adaptive Multi-Resolution Attention with Linear Complexity [18.64163036371161]
We propose a novel structure named Adaptive Multi-Resolution Attention (AdaMRA) for short.
We leverage a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion.
To facilitate AdaMRA utilization by the scientific community, the code implementation will be made publicly available.
arXiv Detail & Related papers (2021-08-10T23:17:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.