Towards Real-Time Panoptic Narrative Grounding by an End-to-End
Grounding Network
- URL: http://arxiv.org/abs/2301.03160v1
- Date: Mon, 9 Jan 2023 03:57:14 GMT
- Title: Towards Real-Time Panoptic Narrative Grounding by an End-to-End
Grounding Network
- Authors: Haowei Wang, Jiayi Ji, Yiyi Zhou, Yongjian Wu, Xiaoshuai Sun
- Abstract summary: Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding task.
We propose a one-stage network for real-time PNG, termed End-to-End Panoptic Narrative Grounding network (EPNG)
Our method achieves a significant improvement of up to 9.4% accuracy.
- Score: 39.64953170583401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding task,
which locates the target regions of an image corresponding to the text
description. Existing approaches for PNG are mainly based on a two-stage
paradigm, which is computationally expensive. In this paper, we propose a
one-stage network for real-time PNG, termed End-to-End Panoptic Narrative
Grounding network (EPNG), which directly generates masks for referents.
Specifically, we propose two innovative designs, i.e., Locality-Perceptive
Attention (LPA) and a bidirectional Semantic Alignment Loss (SAL), to properly
handle the many-to-many relationship between textual expressions and visual
objects. LPA embeds the local spatial priors into attention modeling, i.e., a
pixel may belong to multiple masks at different scales, thereby improving
segmentation. To help understand the complex semantic relationships, SAL
proposes a bidirectional contrastive objective to regularize the semantic
consistency inter modalities. Extensive experiments on the PNG benchmark
dataset demonstrate the effectiveness and efficiency of our method. Compared to
the single-stage baseline, our method achieves a significant improvement of up
to 9.4% accuracy. More importantly, our EPNG is 10 times faster than the
two-stage model. Meanwhile, the generalization ability of EPNG is also
validated by zero-shot experiments on other grounding tasks.
Related papers
- DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation.
We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity.
Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z) - Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model [61.389233691596004]
We introduce the DiffPNG framework, which capitalizes on the diffusion's architecture for segmentation by decomposing the process into a sequence of localization, segmentation, and refinement steps.
Our experiments on the PNG dataset demonstrate that DiffPNG achieves strong performance in the zero-shot PNG task setting.
arXiv Detail & Related papers (2024-07-07T13:06:34Z) - Fine-grained Background Representation for Weakly Supervised Semantic Segmentation [35.346567242839065]
This paper proposes a simple fine-grained background representation (FBR) method to discover and represent diverse BG semantics.
We present an active sampling strategy to mine the FG negatives on-the-fly, enabling efficient pixel-to-pixel intra-foreground contrastive learning.
Our method achieves 73.2 mIoU and 45.6 mIoU segmentation results on Pascal Voc and MS COCO test sets, respectively.
arXiv Detail & Related papers (2024-06-22T06:45:25Z) - Aligning and Prompting Everything All at Once for Universal Visual
Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks.
APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection.
Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z) - SwIPE: Efficient and Robust Medical Image Segmentation with Implicit Patch Embeddings [12.79344668998054]
We propose SwIPE (Segmentation with Implicit Patch Embeddings) to enable accurate local boundary delineation and global shape coherence.
We show that SwIPE significantly improves over recent implicit approaches and outperforms state-of-the-art discrete methods with over 10x fewer parameters.
arXiv Detail & Related papers (2023-07-23T20:55:11Z) - Towards Effective Image Manipulation Detection with Proposal Contrastive
Learning [61.5469708038966]
We propose Proposal Contrastive Learning (PCL) for effective image manipulation detection.
Our PCL consists of a two-stream architecture by extracting two types of global features from RGB and noise views respectively.
Our PCL can be easily adapted to unlabeled data in practice, which can reduce manual labeling costs and promote more generalizable features.
arXiv Detail & Related papers (2022-10-16T13:30:13Z) - Rethinking of the Image Salient Object Detection: Object-level Semantic
Saliency Re-ranking First, Pixel-wise Saliency Refinement Latter [62.26677215668959]
We propose a lightweight, weakly supervised deep network to coarsely locate semantically salient regions.
We then fuse multiple off-the-shelf deep models on these semantically salient regions as the pixel-wise saliency refinement.
Our method is simple yet effective, which is the first attempt to consider the salient object detection mainly as an object-level semantic re-ranking problem.
arXiv Detail & Related papers (2020-08-10T07:12:43Z) - Multi-Margin based Decorrelation Learning for Heterogeneous Face
Recognition [90.26023388850771]
This paper presents a deep neural network approach to extract decorrelation representations in a hyperspherical space for cross-domain face images.
The proposed framework can be divided into two components: heterogeneous representation network and decorrelation representation learning.
Experimental results on two challenging heterogeneous face databases show that our approach achieves superior performance on both verification and recognition tasks.
arXiv Detail & Related papers (2020-05-25T07:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.