ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos
- URL: http://arxiv.org/abs/2107.11629v1
- Date: Sat, 24 Jul 2021 15:14:20 GMT
- Title: ASOD60K: Audio-Induced Salient Object Detection in Panoramic Videos
- Authors: Yi Zhang, Fang-Yi Chao, Ge-Peng Ji, Deng-Ping Fan, Lu Zhang, Ling Shao
- Abstract summary: We propose PV-SOD, a new task that aims to segment salient objects from panoramic videos.
In contrast to existing fixation-level or object-level saliency detection tasks, we focus on multi-modal salient object detection (SOD)
We collect the first large-scale dataset, named ASOD60K, which contains 4K-resolution video frames annotated with a six-level hierarchy.
- Score: 79.05486554647918
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Exploring to what humans pay attention in dynamic panoramic scenes is useful
for many fundamental applications, including augmented reality (AR) in retail,
AR-powered recruitment, and visual language navigation. With this goal in mind,
we propose PV-SOD, a new task that aims to segment salient objects from
panoramic videos. In contrast to existing fixation-level or object-level
saliency detection tasks, we focus on multi-modal salient object detection
(SOD), which mimics human attention mechanism by segmenting salient objects
with the guidance of audio-visual cues. To support this task, we collect the
first large-scale dataset, named ASOD60K, which contains 4K-resolution video
frames annotated with a six-level hierarchy, thus distinguishing itself with
richness, diversity and quality. Specifically, each sequence is marked with
both its super-/sub-class, with objects of each sub-class being further
annotated with human eye fixations, bounding boxes, object-/instance-level
masks, and associated attributes (e.g., geometrical distortion). These
coarse-to-fine annotations enable detailed analysis for PV-SOD modeling, e.g.,
determining the major challenges for existing SOD models, and predicting
scanpaths to study the long-term eye fixation behaviors of humans. We
systematically benchmark 11 representative approaches on ASOD60K and derive
several interesting findings. We hope this study could serve as a good starting
point for advancing SOD research towards panoramic videos.
Related papers
- 3D-Aware Instance Segmentation and Tracking in Egocentric Videos [107.10661490652822]
Egocentric videos present unique challenges for 3D scene understanding.
This paper introduces a novel approach to instance segmentation and tracking in first-person video.
By incorporating spatial and temporal cues, we achieve superior performance compared to state-of-the-art 2D approaches.
arXiv Detail & Related papers (2024-08-19T10:08:25Z) - Learning Spatial-Semantic Features for Robust Video Object Segmentation [108.045326229865]
We propose a robust video object segmentation framework equipped with spatial-semantic features and discriminative object queries.
We show that the proposed method set a new state-of-the-art performance on multiple datasets.
arXiv Detail & Related papers (2024-07-10T15:36:00Z) - Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals.
Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars.
Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Recent Trends in 2D Object Detection and Applications in Video Event
Recognition [0.76146285961466]
We discuss the pioneering works in object detection, followed by the recent breakthroughs that employ deep learning.
We highlight recent datasets for 2D object detection both in images and videos, and present a comparative performance summary of various state-of-the-art object detection techniques.
arXiv Detail & Related papers (2022-02-07T14:15:11Z) - Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR.
Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking.
Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z) - MOPT: Multi-Object Panoptic Tracking [33.77171216778909]
We introduce a novel perception task denoted as multi-object panoptic tracking (MOPT)
MOPT allows for exploiting pixel-level semantic information of 'thing' and'stuff' classes, temporal coherence, and pixel-level associations over time.
We present extensive quantitative and qualitative evaluations of both vision-based and LiDAR-based MOPT that demonstrate encouraging results.
arXiv Detail & Related papers (2020-04-17T11:45:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.