TransVG++: End-to-End Visual Grounding with Language Conditioned Vision
Transformer
- URL: http://arxiv.org/abs/2206.06619v1
- Date: Tue, 14 Jun 2022 06:27:38 GMT
- Title: TransVG++: End-to-End Visual Grounding with Language Conditioned Vision
Transformer
- Authors: Jiajun Deng, Zhengyuan Yang, Daqing Liu, Tianlang Chen, Wengang Zhou,
Yanyong Zhang, Houqiang Li and Wanli Ouyang
- Abstract summary: We explore neat yet effective Transformer-based frameworks for visual grounding.
TransVG establishes multi-modal correspondences by Transformers and localizes referred regions by directly regressing box coordinates.
We upgrade our framework to a purely Transformer-based one by leveraging Vision Transformer (ViT) for vision feature encoding.
- Score: 188.00681648113223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we explore neat yet effective Transformer-based frameworks for
visual grounding. The previous methods generally address the core problem of
visual grounding, i.e., multi-modal fusion and reasoning, with
manually-designed mechanisms. Such heuristic designs are not only complicated
but also make models easily overfit specific data distributions. To avoid this,
we first propose TransVG, which establishes multi-modal correspondences by
Transformers and localizes referred regions by directly regressing box
coordinates. We empirically show that complicated fusion modules can be
replaced by a simple stack of Transformer encoder layers with higher
performance. However, the core fusion Transformer in TransVG is stand-alone
against uni-modal encoders, and thus should be trained from scratch on limited
visual grounding data, which makes it hard to be optimized and leads to
sub-optimal performance. To this end, we further introduce TransVG++ to make
two-fold improvements. For one thing, we upgrade our framework to a purely
Transformer-based one by leveraging Vision Transformer (ViT) for vision feature
encoding. For another, we devise Language Conditioned Vision Transformer that
removes external fusion modules and reuses the uni-modal ViT for
vision-language fusion at the intermediate layers. We conduct extensive
experiments on five prevalent datasets, and report a series of state-of-the-art
records.
Related papers
- Transformer Fusion with Optimal Transport [25.022849817421964]
Fusion is a technique for merging multiple independently-trained neural networks in order to combine their capabilities.
This paper presents a systematic approach for fusing two or more transformer-based networks exploiting Optimal Transport to (soft-)align the various architectural components.
arXiv Detail & Related papers (2023-10-09T13:40:31Z) - Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation [59.91357714415056]
We propose two Transformer variants: Context-Sharing Transformer (CST) and Semantic Gathering-Scattering Transformer (S GST)
CST learns the global-shared contextual information within image frames with a lightweight computation; S GST models the semantic correlation separately for the foreground and background.
Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance.
arXiv Detail & Related papers (2023-08-13T06:12:00Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Glance-and-Gaze Vision Transformer [13.77016463781053]
We propose a new vision Transformer, named Glance-and-Gaze Transformer (GG-Transformer)
It is motivated by the Glance and Gaze behavior of human beings when recognizing objects in natural scenes.
We empirically demonstrate our method achieves consistently superior performance over previous state-of-the-art Transformers.
arXiv Detail & Related papers (2021-06-04T06:13:47Z) - TransVG: End-to-End Visual Grounding with Transformers [102.11922622103613]
We present a transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to an image.
We show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.
arXiv Detail & Related papers (2021-04-17T13:35:24Z) - TransCenter: Transformers with Dense Queries for Multiple-Object
Tracking [87.75122600164167]
We argue that the standard representation -- bounding boxes -- is not adapted to learning transformers for multiple-object tracking.
We propose TransCenter, the first transformer-based architecture for tracking the centers of multiple targets.
arXiv Detail & Related papers (2021-03-28T14:49:36Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.