PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative
Grounding
- URL: http://arxiv.org/abs/2208.05647v1
- Date: Thu, 11 Aug 2022 05:42:12 GMT
- Title: PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative
Grounding
- Authors: Zihan Ding, Zi-han Ding, Tianrui Hui, Junshi Huang, Xiaoming Wei,
Xiaolin Wei, Si Liu
- Abstract summary: We propose a one-stage end-to-end Pixel-Phrase Matching Network (PPMN), which directly matches each phrase to its corresponding pixels instead of region proposals.
Our method achieves new state-of-the-art performance on the PNG benchmark with 4.0 absolute Average Recall gains.
- Score: 24.787497472368244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Panoptic Narrative Grounding (PNG) is an emerging task whose goal is to
segment visual objects of things and stuff categories described by dense
narrative captions of a still image. The previous two-stage approach first
extracts segmentation region proposals by an off-the-shelf panoptic
segmentation model, then conducts coarse region-phrase matching to ground the
candidate regions for each noun phrase. However, the two-stage pipeline usually
suffers from the performance limitation of low-quality proposals in the first
stage and the loss of spatial details caused by region feature pooling, as well
as complicated strategies designed for things and stuff categories separately.
To alleviate these drawbacks, we propose a one-stage end-to-end Pixel-Phrase
Matching Network (PPMN), which directly matches each phrase to its
corresponding pixels instead of region proposals and outputs panoptic
segmentation by simple combination. Thus, our model can exploit sufficient and
finer cross-modal semantic correspondence from the supervision of densely
annotated pixel-phrase pairs rather than sparse region-phrase pairs. In
addition, we also propose a Language-Compatible Pixel Aggregation (LCPA) module
to further enhance the discriminative ability of phrase features through
multi-round refinement, which selects the most compatible pixels for each
phrase to adaptively aggregate the corresponding visual context. Extensive
experiments show that our method achieves new state-of-the-art performance on
the PNG benchmark with 4.0 absolute Average Recall gains.
Related papers
- Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative Grounding [39.73180294057053]
We propose an Extractive-Injective Phrase Adapter (EIPA) bypass within the Diffusion UNet to dynamically update phrase prompts with image features.
We also design a Multi-Level Mutual Aggregation (MLMA) module to reciprocally fuse multi-level image and phrase features for segmentation refinement.
arXiv Detail & Related papers (2024-09-12T17:48:22Z) - MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation [33.67313662538398]
We propose a multi-resolution training framework for open-vocabulary semantic segmentation with a single pretrained CLIP backbone.
MROVSeg uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder.
We demonstrate the superiority of MROVSeg on well-established open-vocabulary semantic segmentation benchmarks.
arXiv Detail & Related papers (2024-08-27T04:45:53Z) - Fine-grained Background Representation for Weakly Supervised Semantic Segmentation [35.346567242839065]
This paper proposes a simple fine-grained background representation (FBR) method to discover and represent diverse BG semantics.
We present an active sampling strategy to mine the FG negatives on-the-fly, enabling efficient pixel-to-pixel intra-foreground contrastive learning.
Our method achieves 73.2 mIoU and 45.6 mIoU segmentation results on Pascal Voc and MS COCO test sets, respectively.
arXiv Detail & Related papers (2024-06-22T06:45:25Z) - GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding [101.32590239809113]
Generalized Perception NeRF (GP-NeRF) is a novel pipeline that makes the widely used segmentation model and NeRF work compatibly under a unified framework.
We propose two self-distillation mechanisms, i.e., the Semantic Distill Loss and the Depth-Guided Semantic Distill Loss, to enhance the discrimination and quality of the semantic field.
arXiv Detail & Related papers (2023-11-20T15:59:41Z) - Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic
Narrative Grounding [43.657151728626125]
Panoptic narrative grounding aims to segment things and stuff objects in an image described by noun phrases of a narrative caption.
We propose a Phrase-Pixel-Object Transformer Decoder (PPO-TD) to enrich phrases with coupled pixel and object contexts.
Our method achieves new state-of-the-art performance with large margins.
arXiv Detail & Related papers (2023-11-02T08:55:28Z) - Context Does Matter: End-to-end Panoptic Narrative Grounding with
Deformable Attention Refined Matching Network [25.511804582983977]
Panoramic Narrative Grounding (PNG) aims to segment visual objects in images based on dense narrative captions.
We propose a novel learning framework called Deformable Attention Refined Matching Network (DRMN)
DRMN iteratively re-encodes pixels with the deformable attention network after updating the feature representation of the top-$k$ most similar pixels.
arXiv Detail & Related papers (2023-10-25T13:12:39Z) - Diffusion Models for Open-Vocabulary Segmentation [79.02153797465324]
OVDiff is a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation.
It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training.
arXiv Detail & Related papers (2023-06-15T17:51:28Z) - Associating Spatially-Consistent Grouping with Text-supervised Semantic
Segmentation [117.36746226803993]
We introduce self-supervised spatially-consistent grouping with text-supervised semantic segmentation.
Considering the part-like grouped results, we further adapt a text-supervised model from image-level to region-level recognition.
Our method achieves 59.2% mIoU and 32.4% mIoU on Pascal VOC and Pascal Context benchmarks.
arXiv Detail & Related papers (2023-04-03T16:24:39Z) - Weakly-supervised segmentation of referring expressions [81.73850439141374]
Text grounded semantic SEGmentation learns segmentation masks directly from image-level referring expressions without pixel-level annotations.
Our approach demonstrates promising results for weakly-supervised referring expression segmentation on the PhraseCut and RefCOCO datasets.
arXiv Detail & Related papers (2022-05-10T07:52:24Z) - Semi-supervised Semantic Segmentation with Directional Context-aware
Consistency [66.49995436833667]
We focus on the semi-supervised segmentation problem where only a small set of labeled data is provided with a much larger collection of totally unlabeled images.
A preferred high-level representation should capture the contextual information while not losing self-awareness.
We present the Directional Contrastive Loss (DC Loss) to accomplish the consistency in a pixel-to-pixel manner.
arXiv Detail & Related papers (2021-06-27T03:42:40Z) - SCNet: Enhancing Few-Shot Semantic Segmentation by Self-Contrastive
Background Prototypes [56.387647750094466]
Few-shot semantic segmentation aims to segment novel-class objects in a query image with only a few annotated examples.
Most of advanced solutions exploit a metric learning framework that performs segmentation through matching each pixel to a learned foreground prototype.
This framework suffers from biased classification due to incomplete construction of sample pairs with the foreground prototype only.
arXiv Detail & Related papers (2021-04-19T11:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.