PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative
Grounding
- URL: http://arxiv.org/abs/2208.05647v1
- Date: Thu, 11 Aug 2022 05:42:12 GMT
- Title: PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative
Grounding
- Authors: Zihan Ding, Zi-han Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei,
Xiaolin Wei, Si Liu
- Abstract summary: We propose a one-stage end-to-end Pixel-Phrase Matching Network (PPMN), which directly matches each phrase to its corresponding pixels instead of region proposals.
Our method achieves new state-of-the-art performance on the PNG benchmark with 4.0 absolute Average Recall gains.
- Score: 24.787497472368244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Panoptic Narrative Grounding (PNG) is an emerging task whose goal is to
segment visual objects of things and stuff categories described by dense
narrative captions of a still image. The previous two-stage approach first
extracts segmentation region proposals by an off-the-shelf panoptic
segmentation model, then conducts coarse region-phrase matching to ground the
candidate regions for each noun phrase. However, the two-stage pipeline usually
suffers from the performance limitation of low-quality proposals in the first
stage and the loss of spatial details caused by region feature pooling, as well
as complicated strategies designed for things and stuff categories separately.
To alleviate these drawbacks, we propose a one-stage end-to-end Pixel-Phrase
Matching Network (PPMN), which directly matches each phrase to its
corresponding pixels instead of region proposals and outputs panoptic
segmentation by simple combination. Thus, our model can exploit sufficient and
finer cross-modal semantic correspondence from the supervision of densely
annotated pixel-phrase pairs rather than sparse region-phrase pairs. In
addition, we also propose a Language-Compatible Pixel Aggregation (LCPA) module
to further enhance the discriminative ability of phrase features through
multi-round refinement, which selects the most compatible pixels for each
phrase to adaptively aggregate the corresponding visual context. Extensive
experiments show that our method achieves new state-of-the-art performance on
the PNG benchmark with 4.0 absolute Average Recall gains.
Related papers
- FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation [63.31007867379312]
Open-vocabulary segmentation aims to identify and segment specific regions and objects based on text-based descriptions.
A common solution is to leverage powerful vision-language models (VLMs), such as CLIP, to bridge the gap between vision and text information.
In contrast, segmentation tasks require fine-grained pixel-level alignment and detailed category boundary information.
We propose FGAseg, a model designed for fine-grained pixel-text alignment and category boundary supplementation.
arXiv Detail & Related papers (2025-01-01T15:47:04Z) - Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding [39.73180294057053]
We propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features.
We also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement.
arXiv Detail & Related papers (2024-09-12T17:48:22Z) - Fine-grained Background Representation for Weakly Supervised Semantic Segmentation [35.346567242839065]
This paper proposes a simple fine-grained background representation (FBR) method to discover and represent diverse BG semantics.
We present an active sampling strategy to mine the FG negatives on-the-fly, enabling efficient pixel-to-pixel intra-foreground contrastive learning.
Our method achieves 73.2 mIoU and 45.6 mIoU segmentation results on Pascal Voc and MS COCO test sets, respectively.
arXiv Detail & Related papers (2024-06-22T06:45:25Z) - Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic
Narrative Grounding [43.657151728626125]
Panoptic narrative grounding aims to segment things and stuff objects in an image described by noun phrases of a narrative caption.
We propose a Phrase-Pixel-Object Transformer Decoder (PPO-TD) to enrich phrases with coupled pixel and object contexts.
Our method achieves new state-of-the-art performance with large margins.
arXiv Detail & Related papers (2023-11-02T08:55:28Z) - Context Does Matter: End-to-end Panoptic Narrative Grounding with
Deformable Attention Refined Matching Network [25.511804582983977]
Panoramic Narrative Grounding (PNG) aims to segment visual objects in images based on dense narrative captions.
We propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN)
DRMN iteratively re-encodes pixels with the deformable attention network after updating the feature representation of the top-$k$ most similar pixels.
arXiv Detail & Related papers (2023-10-25T13:12:39Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Associating Spatially-Consistent Grouping with Text-supervised Semantic
Segmentation [117.36746226803993]
We introduce self-supervised spatially-consistent grouping with text-supervised semantic segmentation.
Considering the part-like grouped results, we further adapt a text-supervised model from image-level to region-level recognition.
Our method achieves 59.2% mIoU and 32.4% mIoU on Pascal VOC and Pascal Context benchmarks.
arXiv Detail & Related papers (2023-04-03T16:24:39Z) - Weakly-supervised segmentation of referring expressions [81.73850439141374]
Text grounded semantic SEGmentation learns segmentation masks directly from image-level referring expressions without pixel-level annotations.
Our approach demonstrates promising results for weakly-supervised referring expression segmentation on the PhraseCut and RefCOCO datasets.
arXiv Detail & Related papers (2022-05-10T07:52:24Z) - Semi-supervised Semantic Segmentation with Directional Context-aware
Consistency [66.49995436833667]
We focus on the semi-supervised segmentation problem where only a small set of labeled data is provided with a much larger collection of totally unlabeled images.
A preferred high-level representation should capture the contextual information while not losing self-awareness.
We present the Directional Contrastive Loss (DC Loss) to accomplish the consistency in a pixel-to-pixel manner.
arXiv Detail & Related papers (2021-06-27T03:42:40Z) - SCNet: Enhancing Few-Shot Semantic Segmentation by Self-Contrastive
Background Prototypes [56.387647750094466]
Few-shot semantic segmentation aims to segment novel-class objects in a query image with only a few annotated examples.
Most of advanced solutions exploit a metric learning framework that performs segmentation through matching each pixel to a learned foreground prototype.
This framework suffers from biased classification due to incomplete construction of sample pairs with the foreground prototype only.
arXiv Detail & Related papers (2021-04-19T11:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.