AttnGrounder: Talking to Cars with Attention
- URL: http://arxiv.org/abs/2009.05684v2
- Date: Fri, 11 Dec 2020 10:00:22 GMT
- Title: AttnGrounder: Talking to Cars with Attention
- Authors: Vivek Mittal
- Abstract summary: We propose a single-stage end-to-end trainable model for the task of visual grounding.
Visual grounding aims to localize a specific object in an image based on a given natural language text query.
We evaluate AttnGrounder on the Talk2Car dataset and show an improvement of 3.26% over the existing methods.
- Score: 6.09170287691728
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose Attention Grounder (AttnGrounder), a single-stage end-to-end
trainable model for the task of visual grounding. Visual grounding aims to
localize a specific object in an image based on a given natural language text
query. Unlike previous methods that use the same text representation for every
image region, we use a visual-text attention module that relates each word in
the given query with every region in the corresponding image for constructing a
region dependent text representation. Furthermore, for improving the
localization ability of our model, we use our visual-text attention module to
generate an attention mask around the referred object. The attention mask is
trained as an auxiliary task using a rectangular mask generated with the
provided ground-truth coordinates. We evaluate AttnGrounder on the Talk2Car
dataset and show an improvement of 3.26% over the existing methods.
Related papers
- In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - Top-Down Framework for Weakly-supervised Grounded Image Captioning [19.00510117145054]
Weakly-supervised grounded image captioning aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision.
We propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level.
arXiv Detail & Related papers (2023-06-13T01:42:18Z) - Locate Then Generate: Bridging Vision and Language with Bounding Box for
Scene-Text VQA [15.74007067413724]
We propose a novel framework for Scene Text Visual Question Answering (STVQA)
It requires models to read scene text in images for question answering.
arXiv Detail & Related papers (2023-04-04T07:46:40Z) - Neural Implicit Vision-Language Feature Fields [40.248658511361015]
We present a zero-shot volumetric open-vocabulary semantic scene segmentation method.
Our method builds on the insight that we can fuse image features from a vision-language model into a neural implicit representation.
We show that our method works on noisy real-world data and can run in real-time on live sensor data dynamically adjusting to text prompts.
arXiv Detail & Related papers (2023-03-20T09:38:09Z) - SpaText: Spatio-Textual Representation for Controllable Image Generation [61.89548017729586]
SpaText is a new method for text-to-image generation using open-vocabulary scene control.
In addition to a global text prompt that describes the entire scene, the user provides a segmentation map.
We show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-conditional-based.
arXiv Detail & Related papers (2022-11-25T18:59:10Z) - Visually-Augmented Language Modeling [137.36789885105642]
We propose a novel pre-training framework, named VaLM, to Visually-augment text tokens with retrieved relevant images for Language Modeling.
With the visually-augmented context, VaLM uses a visual knowledge fusion layer to enable multimodal grounded language modeling.
We evaluate the proposed model on various multimodal commonsense reasoning tasks, which require visual information to excel.
arXiv Detail & Related papers (2022-05-20T13:41:12Z) - Improving Visual Grounding with Visual-Linguistic Verification and
Iterative Reasoning [42.29650807349636]
We propose a transformer-based framework for accurate visual grounding.
We develop a visual-linguistic verification module to focus the visual features on regions relevant to the textual descriptions.
A language-guided feature encoder is also devised to aggregate the visual contexts of the target object to improve the object's distinctiveness.
arXiv Detail & Related papers (2022-04-30T13:48:15Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - Integrating Visuospatial, Linguistic and Commonsense Structure into
Story Visualization [81.26077816854449]
We first explore the use of constituency parse trees for encoding structured input.
Second, we augment the structured input with commonsense information and study the impact of this external knowledge on the generation of visual story.
Third, we incorporate visual structure via bounding boxes and dense captioning to provide feedback about the characters/objects in generated images.
arXiv Detail & Related papers (2021-10-21T00:16:02Z) - Distributed Attention for Grounded Image Captioning [55.752968732796354]
We study the problem of weakly supervised grounded image captioning.
The goal is to automatically generate a sentence describing the context of the image with each noun word grounded to the corresponding region in the image.
arXiv Detail & Related papers (2021-08-02T17:28:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.