TransVG: End-to-End Visual Grounding with Transformers
- URL: http://arxiv.org/abs/2104.08541v1
- Date: Sat, 17 Apr 2021 13:35:24 GMT
- Title: TransVG: End-to-End Visual Grounding with Transformers
- Authors: Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang
Li
- Abstract summary: We present a transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to an image.
We show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.
- Score: 102.11922622103613
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a neat yet effective transformer-based framework
for visual grounding, namely TransVG, to address the task of grounding a
language query to the corresponding region onto an image. The state-of-the-art
methods, including two-stage or one-stage ones, rely on a complex module with
manually-designed mechanisms to perform the query reasoning and multi-modal
fusion. However, the involvement of certain mechanisms in fusion module design,
such as query decomposition and image scene graph, makes the models easily
overfit to datasets with specific scenarios, and limits the plenitudinous
interaction between the visual-linguistic context. To avoid this caveat, we
propose to establish the multi-modal correspondence by leveraging transformers,
and empirically show that the complex fusion modules (e.g., modular attention
network, dynamic graph, and multi-modal tree) can be replaced by a simple stack
of transformer encoder layers with higher performance. Moreover, we
re-formulate the visual grounding as a direct coordinates regression problem
and avoid making predictions out of a set of candidates (i.e., region proposals
or anchor boxes). Extensive experiments are conducted on five widely used
datasets, and a series of state-of-the-art records are set by our TransVG. We
build the benchmark of transformer-based visual grounding framework and will
make our code available to the public.
Related papers
- Unified Frequency-Assisted Transformer Framework for Detecting and
Grounding Multi-Modal Manipulation [109.1912721224697]
We present the Unified Frequency-Assisted transFormer framework, named UFAFormer, to address the DGM4 problem.
By leveraging the discrete wavelet transform, we decompose images into several frequency sub-bands, capturing rich face forgery artifacts.
Our proposed frequency encoder, incorporating intra-band and inter-band self-attentions, explicitly aggregates forgery features within and across diverse sub-bands.
arXiv Detail & Related papers (2023-09-18T11:06:42Z) - ModeT: Learning Deformable Image Registration via Motion Decomposition
Transformer [7.629385629884155]
We propose a novel motion decomposition Transformer (ModeT) to explicitly model multiple motion modalities.
Our method outperforms current state-of-the-art registration networks and Transformers.
arXiv Detail & Related papers (2023-06-09T06:00:05Z) - Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual
Grounding [27.568879624013576]
Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding.
Existing encoder-only grounding framework suffers from heavy computation due to the self-attention operation with quadratic time complexity.
We present Dynamic Mutilmodal DETR (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases.
arXiv Detail & Related papers (2022-09-28T09:43:02Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - TransVG++: End-to-End Visual Grounding with Language Conditioned Vision
Transformer [188.00681648113223]
We explore neat yet effective Transformer-based frameworks for visual grounding.
TransVG establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates.
We upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding.
arXiv Detail & Related papers (2022-06-14T06:27:38Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.