Context Does Matter: End-to-end Panoptic Narrative Grounding with
Deformable Attention Refined Matching Network
- URL: http://arxiv.org/abs/2310.16616v1
- Date: Wed, 25 Oct 2023 13:12:39 GMT
- Title: Context Does Matter: End-to-end Panoptic Narrative Grounding with
Deformable Attention Refined Matching Network
- Authors: Yiming Lin, Xiao-Bo Jin, Qiufeng Wang, Kaizhu Huang
- Abstract summary: Panoramic Narrative Grounding (PNG) aims to segment visual objects in images based on dense narrative captions.
We propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN)
DRMN iteratively re-encodes pixels with the deformable attention network after updating the feature representation of the top-$k$ most similar pixels.
- Score: 25.511804582983977
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Panoramic Narrative Grounding (PNG) is an emerging visual grounding task that
aims to segment visual objects in images based on dense narrative captions. The
current state-of-the-art methods first refine the representation of phrase by
aggregating the most similar $k$ image pixels, and then match the refined text
representations with the pixels of the image feature map to generate
segmentation results. However, simply aggregating sampled image features
ignores the contextual information, which can lead to phrase-to-pixel
mis-match. In this paper, we propose a novel learning framework called
Deformable Attention Refined Matching Network (DRMN), whose main idea is to
bring deformable attention in the iterative process of feature learning to
incorporate essential context information of different scales of pixels. DRMN
iteratively re-encodes pixels with the deformable attention network after
updating the feature representation of the top-$k$ most similar pixels. As
such, DRMN can lead to accurate yet discriminative pixel representations,
purify the top-$k$ most similar pixels, and consequently alleviate the
phrase-to-pixel mis-match substantially.Experimental results show that our
novel design significantly improves the matching results between text phrases
and image pixels. Concretely, DRMN achieves new state-of-the-art performance on
the PNG benchmark with an average recall improvement 3.5%. The codes are
available in: https://github.com/JaMesLiMers/DRMN.
Related papers
- Exploring Fine-Grained Image-Text Alignment for Referring Remote Sensing Image Segmentation [27.95875467352853]
We propose a new referring remote sensing image segmentation method, FIANet, that fully exploits the visual and linguistic representations.
The proposed fine-grained image-text alignment module (FIAM) would simultaneously leverage the features of the input image and the corresponding texts.
We evaluate the effectiveness of the proposed methods on two public referring remote sensing datasets including RefSegRS and RRSIS-D.
arXiv Detail & Related papers (2024-09-20T16:45:32Z) - Improving fine-grained understanding in image-text pre-training [37.163228122323865]
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs.
We show improved performance over competing approaches over both image-level tasks relying on coarse-grained information.
arXiv Detail & Related papers (2024-01-18T10:28:45Z) - Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic
Narrative Grounding [43.657151728626125]
Panoptic narrative grounding aims to segment things and stuff objects in an image described by noun phrases of a narrative caption.
We propose a Phrase-Pixel-Object Transformer Decoder (PPO-TD) to enrich phrases with coupled pixel and object contexts.
Our method achieves new state-of-the-art performance with large margins.
arXiv Detail & Related papers (2023-11-02T08:55:28Z) - Text Augmented Spatial-aware Zero-shot Referring Image Segmentation [60.84423786769453]
We introduce a Text Augmented Spatial-aware (TAS) zero-shot referring image segmentation framework.
TAS incorporates a mask proposal network for instance-level mask extraction, a text-augmented visual-text matching score for mining the image-text correlation, and a spatial for mask post-processing.
The proposed method clearly outperforms state-of-the-art zero-shot referring image segmentation methods.
arXiv Detail & Related papers (2023-10-27T10:52:50Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - ISNet: Integrate Image-Level and Semantic-Level Context for Semantic
Segmentation [64.56511597220837]
Co-occurrent visual pattern makes aggregating contextual information a common paradigm to enhance the pixel representation for semantic image segmentation.
Existing approaches focus on modeling the context from the perspective of the whole image, i.e., aggregating the image-level contextual information.
This paper proposes to augment the pixel representations by aggregating the image-level and semantic-level contextual information.
arXiv Detail & Related papers (2021-08-27T16:38:22Z) - DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis [55.788772366325105]
We propose a Dynamic Aspect-awarE GAN (DAE-GAN) that represents text information comprehensively from multiple granularities, including sentence-level, word-level, and aspect-level.
Inspired by human learning behaviors, we develop a novel Aspect-aware Dynamic Re-drawer (ADR) for image refinement, in which an Attended Global Refinement (AGR) module and an Aspect-aware Local Refinement (ALR) module are alternately employed.
arXiv Detail & Related papers (2021-08-27T07:20:34Z) - Mining Contextual Information Beyond Image for Semantic Segmentation [37.783233906684444]
The paper studies the context aggregation problem in semantic image segmentation.
It proposes to mine the contextual information beyond individual images to further augment the pixel representations.
The proposed method could be effortlessly incorporated into existing segmentation frameworks.
arXiv Detail & Related papers (2021-08-26T14:34:23Z) - AINet: Association Implantation for Superpixel Segmentation [82.21559299694555]
We propose a novel textbfAssociation textbfImplantation (AI) module to enable the network to explicitly capture the relations between the pixel and its surrounding grids.
Our method could not only achieve state-of-the-art performance but maintain satisfactory inference efficiency.
arXiv Detail & Related papers (2021-01-26T10:40:13Z) - Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal
Transformers [46.275416873403614]
We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding.
Our approach makes the most state-of-the-arts in downstream tasks, including Visual Question Answering (VQA), image-text retrieval, Natural Language for Visual Reasoning for Real (NLVR)
arXiv Detail & Related papers (2020-04-02T07:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.