MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language
Queries at Phrase Level
- URL: http://arxiv.org/abs/2006.03776v1
- Date: Sat, 6 Jun 2020 04:14:15 GMT
- Title: MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language
Queries at Phrase Level
- Authors: Amar Shrestha, Krittaphat Pugdeethosapol, Haowen Fang, Qinru Qiu
- Abstract summary: We propose to utilize spatial attention networks for image-level visual-textual fusion.
We refine region proposals with an in-network Region Proposal Network (RPN) and detect single or multiple regions for a phrase query.
For such referring expression dataset ReferIt, our Multi-region Attention-assisted Grounding network (MAGNet) achieves over 12% improvement over the state-of-the-art.
- Score: 6.47137925955334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grounding free-form textual queries necessitates an understanding of these
textual phrases and its relation to the visual cues to reliably reason about
the described locations. Spatial attention networks are known to learn this
relationship and focus its gaze on salient objects in the image. Thus, we
propose to utilize spatial attention networks for image-level visual-textual
fusion preserving local (word) and global (phrase) information to refine region
proposals with an in-network Region Proposal Network (RPN) and detect single or
multiple regions for a phrase query. We focus only on the phrase query - ground
truth pair (referring expression) for a model independent of the constraints of
the datasets i.e. additional attributes, context etc. For such referring
expression dataset ReferIt game, our Multi-region Attention-assisted Grounding
network (MAGNet) achieves over 12\% improvement over the state-of-the-art.
Without the context from image captions and attribute information in Flickr30k
Entities, we still achieve competitive results compared to the
state-of-the-art.
Related papers
- Question-Answer Cross Language Image Matching for Weakly Supervised
Semantic Segmentation [37.15828464616587]
Class Activation Map (CAM) has emerged as a popular tool for weakly supervised semantic segmentation.
We propose a novel Question-Answer Cross-Language-Image Matching framework for WSSS (QA-CLIMS)
arXiv Detail & Related papers (2024-01-18T10:55:13Z) - Top-Down Framework for Weakly-supervised Grounded Image Captioning [19.00510117145054]
Weakly-supervised grounded image captioning aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision.
We propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level.
arXiv Detail & Related papers (2023-06-13T01:42:18Z) - Scene Graph Based Fusion Network For Image-Text Retrieval [2.962083552798791]
A critical challenge to image-text retrieval is how to learn accurate correspondences between images and texts.
We propose a Scene Graph based Fusion Network (dubbed SGFN) which enhances the images'/texts' features through intra- and cross-modal fusion.
Our SGFN performs better than quite a few SOTA image-text retrieval methods.
arXiv Detail & Related papers (2023-03-20T13:22:56Z) - HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval.
First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively.
Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module.
Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z) - Image-Specific Information Suppression and Implicit Local Alignment for
Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text.
Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities.
We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z) - RegionCLIP: Region-based Language-Image Pretraining [94.29924084715316]
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification.
We propose a new method called RegionCLIP that significantly extends CLIP to learn region-level visual representations.
Our method significantly outperforms the state of the art by 3.8 AP50 and 2.2 AP for novel categories on COCO and LVIS datasets.
arXiv Detail & Related papers (2021-12-16T18:39:36Z) - Distributed Attention for Grounded Image Captioning [55.752968732796354]
We study the problem of weakly supervised grounded image captioning.
The goal is to automatically generate a sentence describing the context of the image with each noun word grounded to the corresponding region in the image.
arXiv Detail & Related papers (2021-08-02T17:28:33Z) - Disentangled Motif-aware Graph Learning for Phrase Grounding [48.64279161780489]
We propose a novel graph learning framework for phrase grounding in the image.
We devise the disentangled graph network to integrate the motif-aware contextual information into representations.
Our model achieves state-of-the-art performance on Flickr30K Entities and ReferIt Game benchmarks.
arXiv Detail & Related papers (2021-04-13T08:20:07Z) - MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase
Grounding [74.33171794972688]
We present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations.
Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods.
arXiv Detail & Related papers (2020-10-12T00:43:52Z) - PhraseCut: Language-based Image Segmentation in the Wild [62.643450401286]
We consider the problem of segmenting image regions given a natural language phrase.
Our dataset is collected on top of the Visual Genome dataset.
Our experiments show that the scale and diversity of concepts in our dataset poses significant challenges to the existing state-of-the-art.
arXiv Detail & Related papers (2020-08-03T20:58:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.