TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
- URL: http://arxiv.org/abs/2310.07056v1
- Date: Tue, 10 Oct 2023 22:36:15 GMT
- Title: TextPSG: Panoptic Scene Graph Generation from Textual Descriptions
- Authors: Chengyang Zhao, Yikang Shen, Zhenfang Chen, Mingyu Ding, Chuang Gan
- Abstract summary: We study a new problem of Panoptic Scene Graph Generation from Purely Textual Descriptions (Caption-to-PSG)
The key idea is to leverage the large collection of free image-caption data on the Web alone to generate panoptic scene graphs.
We propose a new framework TextPSG consisting of four modules, i.e., a region grouper, an entity grounder, a segment merger, and a label generator.
- Score: 78.1140391134517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Panoptic Scene Graph has recently been proposed for comprehensive scene
understanding. However, previous works adopt a fully-supervised learning
manner, requiring large amounts of pixel-wise densely-annotated data, which is
always tedious and expensive to obtain. To address this limitation, we study a
new problem of Panoptic Scene Graph Generation from Purely Textual Descriptions
(Caption-to-PSG). The key idea is to leverage the large collection of free
image-caption data on the Web alone to generate panoptic scene graphs. The
problem is very challenging for three constraints: 1) no location priors; 2) no
explicit links between visual regions and textual entities; and 3) no
pre-defined concept sets. To tackle this problem, we propose a new framework
TextPSG consisting of four modules, i.e., a region grouper, an entity grounder,
a segment merger, and a label generator, with several novel techniques. The
region grouper first groups image pixels into different segments and the entity
grounder then aligns visual segments with language entities based on the
textual description of the segment being referred to. The grounding results can
thus serve as pseudo labels enabling the segment merger to learn the segment
similarity as well as guiding the label generator to learn object semantics and
relation predicates, resulting in a fine-grained structured scene
understanding. Our framework is effective, significantly outperforming the
baselines and achieving strong out-of-distribution robustness. We perform
comprehensive ablation studies to corroborate the effectiveness of our design
choices and provide an in-depth analysis to highlight future directions. Our
code, data, and results are available on our project page:
https://vis-www.cs.umass.edu/TextPSG.
Related papers
- GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives [69.36723767339001]
We propose a novel framework named textitGPT4SGG to obtain more accurate and comprehensive scene graph signals.
We show textitGPT4SGG significantly improves the performance of SGG models trained on image-caption data.
arXiv Detail & Related papers (2023-12-07T14:11:00Z) - Betrayed by Captions: Joint Caption Grounding and Generation for Open
Vocabulary Instance Segmentation [80.48979302400868]
We focus on open vocabulary instance segmentation to expand a segmentation model to classify and segment instance-level novel categories.
Previous approaches have relied on massive caption datasets and complex pipelines to establish one-to-one mappings between image regions and captions in nouns.
We devise a joint textbfCaption Grounding and Generation (CGG) framework, which incorporates a novel grounding loss that only focuses on matching object to improve learning efficiency.
arXiv Detail & Related papers (2023-01-02T18:52:12Z) - Image Semantic Relation Generation [0.76146285961466]
Scene graphs can distil complex image information and correct the bias of visual models using semantic-level relations.
In this work, we introduce image semantic relation generation (ISRG), a simple but effective image-to-text model.
arXiv Detail & Related papers (2022-10-19T16:15:19Z) - Panoptic Scene Graph Generation [41.534209967051645]
panoptic scene graph generation (PSG) is a new problem task that requires the model to generate a more comprehensive scene graph representation.
A high-quality PSG dataset contains 49k well-annotated overlapping images from COCO and Visual Genome.
arXiv Detail & Related papers (2022-07-22T17:59:53Z) - GroupViT: Semantic Segmentation Emerges from Text Supervision [82.02467579704091]
Grouping and recognition are important components of visual scene understanding.
We propose a hierarchical Grouping Vision Transformer (GroupViT)
GroupViT learns to group together semantic regions and successfully transfers to the task of semantic segmentation in a zero-shot manner.
arXiv Detail & Related papers (2022-02-22T18:56:04Z) - Segmentation-grounded Scene Graph Generation [47.34166260639392]
We propose a framework for pixel-level segmentation-grounded scene graph generation.
Our framework is agnostic to the underlying scene graph generation method.
It is learned in a multi-task manner with both target and auxiliary datasets.
arXiv Detail & Related papers (2021-04-29T08:54:08Z) - Semantic Segmentation with Generative Models: Semi-Supervised Learning
and Strong Out-of-Domain Generalization [112.68171734288237]
We propose a novel framework for discriminative pixel-level tasks using a generative model of both images and labels.
We learn a generative adversarial network that captures the joint image-label distribution and is trained efficiently using a large set of unlabeled images.
We demonstrate strong in-domain performance compared to several baselines, and are the first to showcase extreme out-of-domain generalization.
arXiv Detail & Related papers (2021-04-12T21:41:25Z) - Learning Physical Graph Representations from Visual Scenes [56.7938395379406]
Physical Scene Graphs (PSGs) represent scenes as hierarchical graphs with nodes corresponding intuitively to object parts at different scales, and edges to physical connections between parts.
PSGNet augments standard CNNs by including: recurrent feedback connections to combine low and high-level image information; graph pooling and vectorization operations that convert spatially-uniform feature maps into object-centric graph structures.
We show that PSGNet outperforms alternative self-supervised scene representation algorithms at scene segmentation tasks.
arXiv Detail & Related papers (2020-06-22T16:10:26Z) - PuzzleNet: Scene Text Detection by Segment Context Graph Learning [9.701699882807251]
We propose a novel decomposition-based method, termed Puzzle Networks (PuzzleNet), to address the challenging scene text detection task.
By building segments as context graphs, MSGCN effectively employs segment context to predict combinations of segments.
Our method can achieve better or comparable performance than current state-of-the-arts, which is beneficial from the exploitation of segment context graph.
arXiv Detail & Related papers (2020-02-26T09:21:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.