Panoptic Video Scene Graph Generation
- URL: http://arxiv.org/abs/2311.17058v1
- Date: Tue, 28 Nov 2023 18:59:57 GMT
- Title: Panoptic Video Scene Graph Generation
- Authors: Jingkang Yang, Wenxuan Peng, Xiangtai Li, Zujin Guo, Liangyu Chen, Bo
Li, Zheng Ma, Kaiyang Zhou, Wayne Zhang, Chen Change Loy, Ziwei Liu
- Abstract summary: We propose and study a new problem called panoptic scene graph generation (PVSG)
PVSG relates to the existing video scene graph generation problem, which focuses on temporal interactions between humans and objects grounded with bounding boxes in videos.
We contribute the PVSG dataset, which consists of 400 videos (289 third-person + 111 egocentric videos) with a total of 150K frames labeled with panoptic segmentation masks as well as fine, temporal scene graphs.
- Score: 110.82362282102288
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Towards building comprehensive real-world visual perception systems, we
propose and study a new problem called panoptic scene graph generation (PVSG).
PVSG relates to the existing video scene graph generation (VidSGG) problem,
which focuses on temporal interactions between humans and objects grounded with
bounding boxes in videos. However, the limitation of bounding boxes in
detecting non-rigid objects and backgrounds often causes VidSGG to miss key
details crucial for comprehensive video understanding. In contrast, PVSG
requires nodes in scene graphs to be grounded by more precise, pixel-level
segmentation masks, which facilitate holistic scene understanding. To advance
research in this new area, we contribute the PVSG dataset, which consists of
400 videos (289 third-person + 111 egocentric videos) with a total of 150K
frames labeled with panoptic segmentation masks as well as fine, temporal scene
graphs. We also provide a variety of baseline methods and share useful design
practices for future work.
Related papers
- 2nd Place Solution for PVUW Challenge 2024: Video Panoptic Segmentation [12.274092278786966]
Video Panoptic (VPS) aims to simultaneously classify, track, segment all objects in a video.
We propose a robust integrated video panoptic segmentation solution.
Our method achieves state-of-the-art performance with a VPQ score of 56.36 and 57.12 in the development and test phases.
arXiv Detail & Related papers (2024-06-01T17:03:16Z) - TextPSG: Panoptic Scene Graph Generation from Textual Descriptions [78.1140391134517]
We study a new problem of Panoptic Scene Graph Generation from Purely Textual Descriptions (Caption-to-PSG)
The key idea is to leverage the large collection of free image-caption data on the Web alone to generate panoptic scene graphs.
We propose a new framework TextPSG consisting of four modules, i.e., a region grouper, an entity grounder, a segment merger, and a label generator.
arXiv Detail & Related papers (2023-10-10T22:36:15Z) - PanoVOS: Bridging Non-panoramic and Panoramic Views with Transformer for Video Segmentation [39.269864548255576]
We present a panoramic video dataset, PanoVOS.
The dataset provides 150 videos with high video resolutions and diverse motions.
We present a Panoramic Space Consistency Transformer (PSCFormer) which can effectively utilize the semantic boundary information of the previous frame for pixel-level matching with the current frame.
arXiv Detail & Related papers (2023-09-21T17:59:02Z) - Panoptic Scene Graph Generation [41.534209967051645]
panoptic scene graph generation (PSG) is a new problem task that requires the model to generate a more comprehensive scene graph representation.
A high-quality PSG dataset contains 49k well-annotated overlapping images from COCO and Visual Genome.
arXiv Detail & Related papers (2022-07-22T17:59:53Z) - Pano-AVQA: Grounded Audio-Visual Question Answering on 360$^\circ$
Videos [42.32743253830288]
We propose a benchmark named Pano-AVQA as a large-scale grounded audio-visual question answering dataset on panoramic videos.
Using 5.4K 360$circ$ video clips harvested online, we collect two types of novel question-answer pairs with bounding-box grounding.
We train several transformer-based models from Pano-AVQA, where the results suggest that our proposed spherical spatial embeddings and multimodal training objectives fairly contribute to a better semantic understanding of the panoramic surroundings on the dataset.
arXiv Detail & Related papers (2021-10-11T09:58:05Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z) - Learning Joint Spatial-Temporal Transformations for Video Inpainting [58.939131620135235]
We propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting.
We simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss.
arXiv Detail & Related papers (2020-07-20T16:35:48Z) - Learning Physical Graph Representations from Visual Scenes [56.7938395379406]
Physical Scene Graphs (PSGs) represent scenes as hierarchical graphs with nodes corresponding intuitively to object parts at different scales, and edges to physical connections between parts.
PSGNet augments standard CNNs by including: recurrent feedback connections to combine low and high-level image information; graph pooling and vectorization operations that convert spatially-uniform feature maps into object-centric graph structures.
We show that PSGNet outperforms alternative self-supervised scene representation algorithms at scene segmentation tasks.
arXiv Detail & Related papers (2020-06-22T16:10:26Z) - Video Panoptic Segmentation [117.08520543864054]
We propose and explore a new video extension of this task, called video panoptic segmentation.
To invigorate research on this new task, we present two types of video panoptic datasets.
We propose a novel video panoptic segmentation network (VPSNet) which jointly predicts object classes, bounding boxes, masks, instance id tracking, and semantic segmentation in video frames.
arXiv Detail & Related papers (2020-06-19T19:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.