1st Place Solution for PSG competition with ECCV'22 SenseHuman Workshop
- URL: http://arxiv.org/abs/2302.02651v1
- Date: Mon, 6 Feb 2023 09:47:46 GMT
- Title: 1st Place Solution for PSG competition with ECCV'22 SenseHuman Workshop
- Authors: Qixun Wang, Xiaofeng Guo and Haofan Wang
- Abstract summary: Panoptic Scene Graph (PSG) generation aims to generate scene graph representations based on panoptic segmentation instead of rigid bounding boxes.
We propose GRNet, a Global Relation Network in two-stage paradigm, where the pre-extracted local object features and their corresponding masks are fed into a transformer with class embeddings.
We conduct comprehensive experiments on OpenPSG dataset and achieve the state-of-art performance on the leadboard.
- Score: 1.5362025549031049
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Panoptic Scene Graph (PSG) generation aims to generate scene graph
representations based on panoptic segmentation instead of rigid bounding boxes.
Existing PSG methods utilize one-stage paradigm which simultaneously generates
scene graphs and predicts semantic segmentation masks or two-stage paradigm
that first adopt an off-the-shelf panoptic segmentor, then pairwise
relationship prediction between these predicted objects. One-stage approach
despite having a simplified training paradigm, its segmentation results are
usually under-satisfactory, while two-stage approach lacks global context and
leads to low performance on relation prediction. To bridge this gap, in this
paper, we propose GRNet, a Global Relation Network in two-stage paradigm, where
the pre-extracted local object features and their corresponding masks are fed
into a transformer with class embeddings. To handle relation ambiguity and
predicate classification bias caused by long-tailed distribution, we formulate
relation prediction in the second stage as a multi-class classification task
with soft label. We conduct comprehensive experiments on OpenPSG dataset and
achieve the state-of-art performance on the leadboard. We also show the
effectiveness of our soft label strategy for long-tailed classes in ablation
studies. Our code has been released in https://github.com/wangqixun/mfpsg.
Related papers
- OpenPSG: Open-set Panoptic Scene Graph Generation via Large Multimodal Models [28.742671870397757]
Panoptic Scene Graph Generation (PSG) aims to segment objects and recognize their relations, enabling the structured understanding of an image.
Previous methods focus on predicting predefined object and relation categories, hence limiting their applications in the open world scenarios.
In this paper, we focus on the task of open-set relation prediction integrated with a pretrained open-set panoptic segmentation model.
arXiv Detail & Related papers (2024-07-15T19:56:42Z) - Pair then Relation: Pair-Net for Panoptic Scene Graph Generation [54.92476119356985]
Panoptic Scene Graph (PSG) aims to create a more comprehensive scene graph representation using panoptic segmentation instead of boxes.
Current PSG methods have limited performance, which hinders downstream tasks or applications.
We present a novel framework: Pair then Relation (Pair-Net), which uses a Pair Proposal Network (PPN) to learn and filter sparse pair-wise relationships between subjects and objects.
arXiv Detail & Related papers (2023-07-17T17:58:37Z) - PUPS: Point Cloud Unified Panoptic Segmentation [13.668363631123649]
We propose a simple but effective point cloud unified panoptic segmentation (PUPS) framework.
PUPS uses a set of point-level classifiers to directly predict semantic and instance groupings in an end-to-end manner.
PUPS achieves 1st place on the leader board of Semantic KITTI panoptic segmentation task and state-of-the-art results on nuScenes.
arXiv Detail & Related papers (2023-02-13T08:42:41Z) - Panoptic Scene Graph Generation [41.534209967051645]
panoptic scene graph generation (PSG) is a new problem task that requires the model to generate a more comprehensive scene graph representation.
A high-quality PSG dataset contains 49k well-annotated overlapping images from COCO and Visual Genome.
arXiv Detail & Related papers (2022-07-22T17:59:53Z) - Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction.
Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information.
We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z) - Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased
Scene Graph Generation [62.96628432641806]
Scene Graph Generation aims to first encode the visual contents within the given image and then parse them into a compact summary graph.
We first present a novel Stacked Hybrid-Attention network, which facilitates the intra-modal refinement as well as the inter-modal interaction.
We then devise an innovative Group Collaborative Learning strategy to optimize the decoder.
arXiv Detail & Related papers (2022-03-18T09:14:13Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z) - Adversarial Bipartite Graph Learning for Video Domain Adaptation [50.68420708387015]
Domain adaptation techniques, which focus on adapting models between distributionally different domains, are rarely explored in the video recognition area.
Recent works on visual domain adaptation which leverage adversarial learning to unify the source and target video representations are not highly effective on the videos.
This paper proposes an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions.
arXiv Detail & Related papers (2020-07-31T03:48:41Z) - MatchGAN: A Self-Supervised Semi-Supervised Conditional Generative
Adversarial Network [51.84251358009803]
We present a novel self-supervised learning approach for conditional generative adversarial networks (GANs) under a semi-supervised setting.
We perform augmentation by randomly sampling sensible labels from the label space of the few labelled examples available.
Our method surpasses the baseline with only 20% of the labelled examples used to train the baseline.
arXiv Detail & Related papers (2020-06-11T17:14:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.