360+x: A Panoptic Multi-modal Scene Understanding Dataset
- URL: http://arxiv.org/abs/2404.00989v2
- Date: Mon, 8 Apr 2024 02:37:25 GMT
- Title: 360+x: A Panoptic Multi-modal Scene Understanding Dataset
- Authors: Hao Chen, Yuqi Hou, Chenyuan Qu, Irene Testini, Xiaohan Hong, Jianbo Jiao,
- Abstract summary: 360+x is the first database that covers multiple viewpoints with multiple data modalities to mimic how daily information is accessed in the real world.
To the best of our knowledge, this is the first database that covers multiple viewpoints with multiple data modalities to mimic how daily information is accessed in the real world.
- Score: 13.823967656097146
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Human perception of the world is shaped by a multitude of viewpoints and modalities. While many existing datasets focus on scene understanding from a certain perspective (e.g. egocentric or third-person views), our dataset offers a panoptic perspective (i.e. multiple viewpoints with multiple data modalities). Specifically, we encapsulate third-person panoramic and front views, as well as egocentric monocular/binocular views with rich modalities including video, multi-channel audio, directional binaural delay, location data and textual scene descriptions within each scene captured, presenting comprehensive observation of the world. Figure 1 offers a glimpse of all 28 scene categories of our 360+x dataset. To the best of our knowledge, this is the first database that covers multiple viewpoints with multiple data modalities to mimic how daily information is accessed in the real world. Through our benchmark analysis, we presented 5 different scene understanding tasks on the proposed 360+x dataset to evaluate the impact and benefit of each data modality and perspective in panoptic scene understanding. We hope this unique dataset could broaden the scope of comprehensive scene understanding and encourage the community to approach these problems from more diverse perspectives.
Related papers
- Which Viewpoint Shows it Best? Language for Weakly Supervising View Selection in Multi-view Videos [66.1935609072708]
Key hypothesis is that the more accurately an individual view can predict a view-agnostic text summary, the more informative it is.
We propose a framework that uses the relative accuracy of view-dependent caption predictions as a proxy for best view pseudo-labels.
During inference, our model takes as input only a multi-view video -- no language or camera poses -- and returns the best viewpoint to watch at each timestep.
arXiv Detail & Related papers (2024-11-13T16:31:08Z) - 360 in the Wild: Dataset for Depth Prediction and View Synthesis [66.58513725342125]
We introduce a large scale 360$circ$ videos dataset in the wild.
This dataset has been carefully scraped from the Internet and has been captured from various locations worldwide.
Each of the 25K images constituting our dataset is provided with its respective camera's pose and depth map.
arXiv Detail & Related papers (2024-06-27T05:26:38Z) - Video Question Answering for People with Visual Impairments Using an Egocentric 360-Degree Camera [2.427410108595295]
This paper addresses the daily challenges encountered by visually impaired individuals, such as limited access to information, navigation difficulties, and barriers to social interaction.
To alleviate these challenges, we introduce a novel visual question answering dataset.
It features videos captured using a 360-degree egocentric wearable camera, enabling observation of the entire surroundings.
arXiv Detail & Related papers (2024-05-30T08:02:05Z) - Panonut360: A Head and Eye Tracking Dataset for Panoramic Video [0.0]
We present a head and eye tracking dataset involving 50 users watching 15 panoramic videos.
The dataset provides details on the viewport and gaze attention locations of users.
Our analysis reveals a consistent downward offset in gaze fixations relative to the Field of View.
arXiv Detail & Related papers (2024-03-26T13:54:52Z) - POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object
Interaction in the Multi-View World [59.545114016224254]
Humans are good at translating third-person observations of hand-object interactions into an egocentric view.
We propose a Prompt-Oriented View-agnostic learning framework, which enables this view adaptation with few egocentric videos.
arXiv Detail & Related papers (2024-03-09T09:54:44Z) - EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards
Embodied AI [88.03089807278188]
EmbodiedScan is a multi-modal, ego-centric 3D perception dataset and benchmark for holistic 3D scene understanding.
It encompasses over 5k scans encapsulating 1M ego-centric RGB-D views, 1M language prompts, 160k 3D-oriented boxes spanning over 760 categories.
Building upon this database, we introduce a baseline framework named Embodied Perceptron.
It is capable of processing an arbitrary number of multi-modal inputs and demonstrates remarkable 3D perception capabilities.
arXiv Detail & Related papers (2023-12-26T18:59:11Z) - The All-Seeing Project: Towards Panoptic Visual Recognition and
Understanding of the Open World [71.52132776748628]
We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world.
We create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions.
We develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding.
arXiv Detail & Related papers (2023-08-03T17:59:47Z) - FSVVD: A Dataset of Full Scene Volumetric Video [2.9151420469958533]
In this paper, we focus on the current most widely used data format, point cloud, and for the first time release a full-scene volumetric video dataset.
Comprehensive dataset description and analysis are conducted, with potential usage of this dataset.
arXiv Detail & Related papers (2023-03-07T02:31:08Z) - Unsupervised Learning of Compositional Scene Representations from
Multiple Unspecified Viewpoints [41.07379505694274]
We consider a novel problem of learning compositional scene representations from multiple unspecified viewpoints without using any supervision.
We propose a deep generative model which separates latent representations into a viewpoint-independent part and a viewpoint-dependent part to solve this problem.
Experiments on several specifically designed synthetic datasets have shown that the proposed method is able to effectively learn from multiple unspecified viewpoints.
arXiv Detail & Related papers (2021-12-07T08:45:21Z) - Multiview Detection with Feature Perspective Transformation [59.34619548026885]
We propose a novel multiview detection system, MVDet.
We take an anchor-free approach to aggregate multiview information by projecting feature maps onto the ground plane.
Our entire model is end-to-end learnable and achieves 88.2% MODA on the standard Wildtrack dataset.
arXiv Detail & Related papers (2020-07-14T17:58:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.