Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual
Grounding
- URL: http://arxiv.org/abs/2209.13959v2
- Date: Thu, 26 Oct 2023 05:41:20 GMT
- Title: Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual
Grounding
- Authors: Fengyuan Shi, Ruopeng Gao, Weilin Huang, Limin Wang
- Abstract summary: Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding.
Existing encoder-only grounding framework suffers from heavy computation due to the self-attention operation with quadratic time complexity.
We present Dynamic Mutilmodal DETR (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases.
- Score: 27.568879624013576
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal transformer exhibits high capacity and flexibility to align image
and text for visual grounding. However, the existing encoder-only grounding
framework (e.g., TransVG) suffers from heavy computation due to the
self-attention operation with quadratic time complexity. To address this issue,
we present a new multimodal transformer architecture, coined as Dynamic
Mutilmodal DETR (Dynamic MDETR), by decoupling the whole grounding process into
encoding and decoding phases. The key observation is that there exists high
spatial redundancy in images. Thus, we devise a new dynamic multimodal
transformer decoder by exploiting this sparsity prior to speed up the visual
grounding process. Specifically, our dynamic decoder is composed of a 2D
adaptive sampling module and a text guided decoding module. The sampling module
aims to select these informative patches by predicting the offsets with respect
to a reference point, while the decoding module works for extracting the
grounded object information by performing cross attention between image
features and text features. These two modules are stacked alternatively to
gradually bridge the modality gap and iteratively refine the reference point of
grounded object, eventually realizing the objective of visual grounding.
Extensive experiments on five benchmarks demonstrate that our proposed Dynamic
MDETR achieves competitive trade-offs between computation and accuracy.
Notably, using only 9% feature points in the decoder, we can reduce ~44% GFLOPs
of the multimodal transformer, but still get higher accuracy than the
encoder-only counterpart. In addition, to verify its generalization ability and
scale up our Dynamic MDETR, we build the first one-stage CLIP empowered visual
grounding framework, and achieve the state-of-the-art performance on these
benchmarks.
Related papers
- Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Minimalist and High-Performance Semantic Segmentation with Plain Vision
Transformers [10.72362704573323]
We introduce the PlainSeg, a model comprising only three 3$times$3 convolutions in addition to the transformer layers.
We also present the PlainSeg-Hier, which allows for the utilization of hierarchical features.
arXiv Detail & Related papers (2023-10-19T14:01:40Z) - Dynamic Grained Encoder for Vision Transformers [150.02797954201424]
This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images.
We propose a Dynamic Grained for vision transformers, which can adaptively assign a suitable number of queries to each spatial region.
Our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2023-01-10T07:55:29Z) - DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation [56.514462874501675]
We propose a dynamic sparse attention based Transformer model to achieve fine-level matching with favorable efficiency.
The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on.
Experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details.
arXiv Detail & Related papers (2022-07-13T11:12:03Z) - Spatial-Temporal Transformer for Dynamic Scene Graph Generation [34.190733855032065]
We propose a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input.
Our method is validated on the benchmark dataset Action Genome (AG)
arXiv Detail & Related papers (2021-07-26T16:30:30Z) - MODETR: Moving Object Detection with Transformers [2.4366811507669124]
Moving Object Detection (MOD) is a crucial task for the Autonomous Driving pipeline.
In this paper, we tackle this problem through multi-head attention mechanisms, both across the spatial and motion streams.
We propose MODETR; a Moving Object DEtection TRansformer network, comprised of multi-stream transformers for both spatial and motion modalities.
arXiv Detail & Related papers (2021-06-21T21:56:46Z) - TransVG: End-to-End Visual Grounding with Transformers [102.11922622103613]
We present a transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to an image.
We show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.
arXiv Detail & Related papers (2021-04-17T13:35:24Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.