Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model
- URL: http://arxiv.org/abs/2407.05352v1
- Date: Sun, 7 Jul 2024 13:06:34 GMT
- Title: Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model
- Authors: Danni Yang, Ruohan Dong, Jiayi Ji, Yiwei Ma, Haowei Wang, Xiaoshuai Sun, Rongrong Ji,
- Abstract summary: We introduce the DiffPNG framework, which capitalizes on the diffusion's architecture for segmentation by decomposing the process into a sequence of localization, segmentation, and refinement steps.
Our experiments on the PNG dataset demonstrate that DiffPNG achieves strong performance in the zero-shot PNG task setting.
- Score: 61.389233691596004
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recently, diffusion models have increasingly demonstrated their capabilities in vision understanding. By leveraging prompt-based learning to construct sentences, these models have shown proficiency in classification and visual grounding tasks. However, existing approaches primarily showcase their ability to perform sentence-level localization, leaving the potential for leveraging contextual information for phrase-level understanding largely unexplored. In this paper, we utilize Panoptic Narrative Grounding (PNG) as a proxy task to investigate this capability further. PNG aims to segment object instances mentioned by multiple noun phrases within a given narrative text. Specifically, we introduce the DiffPNG framework, a straightforward yet effective approach that fully capitalizes on the diffusion's architecture for segmentation by decomposing the process into a sequence of localization, segmentation, and refinement steps. The framework initially identifies anchor points using cross-attention mechanisms and subsequently performs segmentation with self-attention to achieve zero-shot PNG. Moreover, we introduce a refinement module based on SAM to enhance the quality of the segmentation masks. Our extensive experiments on the PNG dataset demonstrate that DiffPNG achieves strong performance in the zero-shot PNG task setting, conclusively proving the diffusion model's capability for context-aware, phrase-level understanding. Source code is available at \url{https://github.com/nini0919/DiffPNG}.
Related papers
- Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding [39.73180294057053]
We propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features.
We also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement.
arXiv Detail & Related papers (2024-09-12T17:48:22Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought
Language Prompting [8.12405696290333]
CPSeg is a framework designed to augment image segmentation performance by integrating a novel "Chain-of-Thought" process.
We propose a new vision-language dataset, FloodPrompt, which includes images, semantic masks, and corresponding text information.
arXiv Detail & Related papers (2023-10-24T13:32:32Z) - Holistic Prototype Attention Network for Few-Shot VOS [74.25124421163542]
Few-shot video object segmentation (FSVOS) aims to segment dynamic objects of unseen classes by resorting to a small set of support images.
We propose a holistic prototype attention network (HPAN) for advancing FSVOS.
arXiv Detail & Related papers (2023-07-16T03:48:57Z) - Prompting Diffusion Representations for Cross-Domain Semantic
Segmentation [101.04326113360342]
diffusion-pretraining achieves extraordinary domain generalization results for semantic segmentation.
We introduce a scene prompt and a prompt randomization strategy to help further disentangle the domain-invariant information when training the segmentation head.
arXiv Detail & Related papers (2023-07-05T09:28:25Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Towards Real-Time Panoptic Narrative Grounding by an End-to-End
Grounding Network [39.64953170583401]
Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding task.
We propose a one-stage network for real-time PNG, termed End-to-End Panoptic Narrative Grounding network (EPNG)
Our method achieves a significant improvement of up to 9.4% accuracy.
arXiv Detail & Related papers (2023-01-09T03:57:14Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.