Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic
Narrative Grounding
- URL: http://arxiv.org/abs/2311.01091v2
- Date: Sun, 10 Mar 2024 12:59:53 GMT
- Title: Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic
Narrative Grounding
- Authors: Tianrui Hui, Zihan Ding, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao
Dai, Jizhong Han, Si Liu
- Abstract summary: Panoptic narrative grounding aims to segment things and stuff objects in an image described by noun phrases of a narrative caption.
We propose a Phrase-Pixel-Object Transformer Decoder (PPO-TD) to enrich phrases with coupled pixel and object contexts.
Our method achieves new state-of-the-art performance with large margins.
- Score: 43.657151728626125
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Panoptic narrative grounding (PNG) aims to segment things and stuff objects
in an image described by noun phrases of a narrative caption. As a multimodal
task, an essential aspect of PNG is the visual-linguistic interaction between
image and caption. The previous two-stage method aggregates visual contexts
from offline-generated mask proposals to phrase features, which tend to be
noisy and fragmentary. The recent one-stage method aggregates only pixel
contexts from image features to phrase features, which may incur semantic
misalignment due to lacking object priors. To realize more comprehensive
visual-linguistic interaction, we propose to enrich phrases with coupled pixel
and object contexts by designing a Phrase-Pixel-Object Transformer Decoder
(PPO-TD), where both fine-grained part details and coarse-grained entity clues
are aggregated to phrase features. In addition, we also propose a PhraseObject
Contrastive Loss (POCL) to pull closer the matched phrase-object pairs and push
away unmatched ones for aggregating more precise object contexts from more
phrase-relevant object tokens. Extensive experiments on the PNG benchmark show
our method achieves new state-of-the-art performance with large margins.
Related papers
- Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding [39.73180294057053]
We propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features.
We also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement.
arXiv Detail & Related papers (2024-09-12T17:48:22Z) - In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model [61.389233691596004]
We introduce the DiffPNG framework, which capitalizes on the diffusion's architecture for segmentation by decomposing the process into a sequence of localization, segmentation, and refinement steps.
Our experiments on the PNG dataset demonstrate that DiffPNG achieves strong performance in the zero-shot PNG task setting.
arXiv Detail & Related papers (2024-07-07T13:06:34Z) - Context Does Matter: End-to-end Panoptic Narrative Grounding with
Deformable Attention Refined Matching Network [25.511804582983977]
Panoramic Narrative Grounding (PNG) aims to segment visual objects in images based on dense narrative captions.
We propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN)
DRMN iteratively re-encodes pixels with the deformable attention network after updating the feature representation of the top-$k$ most similar pixels.
arXiv Detail & Related papers (2023-10-25T13:12:39Z) - Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images.
We propose a face-naming module for learning better name embeddings.
We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z) - PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative
Grounding [24.787497472368244]
We propose a one-stage end-to-end Pixel-Phrase Matching Network (PPMN), which directly matches each phrase to its corresponding pixels instead of region proposals.
Our method achieves new state-of-the-art performance on the PNG benchmark with 4.0 absolute Average Recall gains.
arXiv Detail & Related papers (2022-08-11T05:42:12Z) - Panoptic-based Object Style-Align for Image-to-Image Translation [2.226472061870956]
We propose panoptic-based object style-align generative adversarial networks (POSA-GANs) for image-to-image translation.
The proposed method was systematically compared with different competing methods and obtained significant improvement on both image quality and object recognition performance for translated images.
arXiv Detail & Related papers (2021-12-03T14:28:11Z) - MOC-GAN: Mixing Objects and Captions to Generate Realistic Images [21.240099965546637]
We introduce a more rational setting, generating a realistic image from the objects and captions.
Under this setting, objects explicitly define the critical roles in the targeted images and captions implicitly describe their rich attributes and connections.
A MOC-GAN is proposed to mix the inputs of two modalities to generate realistic images.
arXiv Detail & Related papers (2021-06-06T14:04:07Z) - Dense Relational Image Captioning via Multi-task Triple-Stream Networks [95.0476489266988]
We introduce dense captioning, a novel task which aims to generate captions with respect to information between objects in a visual scene.
This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding.
arXiv Detail & Related papers (2020-10-08T09:17:55Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.