SeqTR: A Simple yet Universal Network for Visual Grounding
- URL: http://arxiv.org/abs/2203.16265v1
- Date: Wed, 30 Mar 2022 12:52:46 GMT
- Title: SeqTR: A Simple yet Universal Network for Visual Grounding
- Authors: Chaoyang Zhu, Yiyi Zhou, Yunhang Shen, Gen Luo, Xingjia Pan, Mingbao
Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, Rongrong Ji
- Abstract summary: We propose a simple yet universal network termed SeqTR for visual grounding tasks.
We cast visual grounding as a point prediction problem conditioned on image and text inputs.
Under this paradigm, visual grounding tasks are unified in our SeqTR network without task-specific branches or heads.
- Score: 88.03253818868204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a simple yet universal network termed SeqTR for
visual grounding tasks, e.g., phrase localization, referring expression
comprehension (REC) and segmentation (RES). The canonical paradigms for visual
grounding often require substantial expertise in designing network
architectures and loss functions, making them hard to generalize across tasks.
To simplify and unify the modeling, we cast visual grounding as a point
prediction problem conditioned on image and text inputs, where either the
bounding box or binary mask is represented as a sequence of discrete coordinate
tokens. Under this paradigm, visual grounding tasks are unified in our SeqTR
network without task-specific branches or heads, e.g., the convolutional mask
decoder for RES, which greatly reduces the complexity of multi-task modeling.
In addition, SeqTR also shares the same optimization objective for all tasks
with a simple cross-entropy loss, further reducing the complexity of deploying
hand-crafted loss functions. Experiments on five benchmark datasets demonstrate
that the proposed SeqTR outperforms (or is on par with) the existing
state-of-the-arts, proving that a simple yet universal approach for visual
grounding is indeed feasible.
Related papers
- Human-Guided Complexity-Controlled Abstractions [30.38996929410352]
We train neural models to generate a spectrum of discrete representations and control the complexity.
We show that tuning the representation to a task-appropriate complexity level supports the highest finetuning performance.
Our results indicate a promising direction for rapid model finetuning by leveraging human insight.
arXiv Detail & Related papers (2023-10-26T16:45:34Z) - ClusVPR: Efficient Visual Place Recognition with Clustering-based
Weighted Transformer [13.0858576267115]
We present ClusVPR, a novel approach that tackles the specific issues of redundant information in duplicate regions and representations of small objects.
ClusVPR introduces a unique paradigm called Clustering-based weighted Transformer Network (CWTNet)
We also introduce the optimized-VLAD layer that significantly reduces the number of parameters and enhances model efficiency.
arXiv Detail & Related papers (2023-10-06T09:01:15Z) - Semantic-aware Modular Capsule Routing for Visual Question Answering [55.03883681191765]
We propose a Semantic-aware modUlar caPsulE framework, termed as SUPER, to better capture the instance-specific vision-semantic characteristics.
We comparatively justify the effectiveness and generalization ability of our proposed SUPER scheme over five benchmark datasets.
arXiv Detail & Related papers (2022-07-21T10:48:37Z) - A Unified Architecture of Semantic Segmentation and Hierarchical
Generative Adversarial Networks for Expression Manipulation [52.911307452212256]
We develop a unified architecture of semantic segmentation and hierarchical GANs.
A unique advantage of our framework is that on forward pass the semantic segmentation network conditions the generative model.
We evaluate our method on two challenging facial expression translation benchmarks, AffectNet and RaFD, and a semantic segmentation benchmark, CelebAMask-HQ.
arXiv Detail & Related papers (2021-12-08T22:06:31Z) - Referring Transformer: A One-step Approach to Multi-task Visual
Grounding [45.42959940733406]
We propose a simple one-stage multi-task framework for visual grounding tasks.
Specifically, we leverage a transformer architecture, where two modalities are fused in a visual-lingual encoder.
We show that our model benefits greatly from contextualized information and multi-task training.
arXiv Detail & Related papers (2021-06-06T10:53:39Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z) - A Model-driven Deep Neural Network for Single Image Rain Removal [52.787356046951494]
We propose a model-driven deep neural network for the task, with fully interpretable network structures.
Based on the convolutional dictionary learning mechanism for representing rain, we propose a novel single image deraining model.
All the rain kernels and operators can be automatically extracted, faithfully characterizing the features of both rain and clean background layers.
arXiv Detail & Related papers (2020-05-04T09:13:25Z) - LSM: Learning Subspace Minimization for Low-level Vision [78.27774638569218]
We replace the regularization term with a learnable subspace constraint, and preserve the data term to exploit domain knowledge.
This learning subspace minimization (LSM) framework unifies the network structures and the parameters for many low-level vision tasks.
We demonstrate our LSM framework on four low-level tasks including interactive image segmentation, video segmentation, stereo matching, and optical flow, and validate the network on various datasets.
arXiv Detail & Related papers (2020-04-20T10:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.