FSVVD: A Dataset of Full Scene Volumetric Video
- URL: http://arxiv.org/abs/2303.03599v2
- Date: Mon, 17 Apr 2023 08:50:55 GMT
- Title: FSVVD: A Dataset of Full Scene Volumetric Video
- Authors: Kaiyuan Hu, Yili Jin, Haowen Yang, Junhua Liu, Fangxin Wang
- Abstract summary: In this paper, we focus on the current most widely used data format, point cloud, and for the first time release a full-scene volumetric video dataset.
Comprehensive dataset description and analysis are conducted, with potential usage of this dataset.
- Score: 2.9151420469958533
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have witnessed a rapid development of immersive multimedia which
bridges the gap between the real world and virtual space. Volumetric videos, as
an emerging representative 3D video paradigm that empowers extended reality,
stand out to provide unprecedented immersive and interactive video watching
experience. Despite the tremendous potential, the research towards 3D
volumetric video is still in its infancy, relying on sufficient and complete
datasets for further exploration. However, existing related volumetric video
datasets mostly only include a single object, lacking details about the scene
and the interaction between them. In this paper, we focus on the current most
widely used data format, point cloud, and for the first time release a
full-scene volumetric video dataset that includes multiple people and their
daily activities interacting with the external environments. Comprehensive
dataset description and analysis are conducted, with potential usage of this
dataset. The dataset and additional tools can be accessed via the following
website: https://cuhksz-inml.github.io/full_scene_volumetric_video_dataset/.
Related papers
- CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - Panonut360: A Head and Eye Tracking Dataset for Panoramic Video [0.0]
We present a head and eye tracking dataset involving 50 users watching 15 panoramic videos.
The dataset provides details on the viewport and gaze attention locations of users.
Our analysis reveals a consistent downward offset in gaze fixations relative to the Field of View.
arXiv Detail & Related papers (2024-03-26T13:54:52Z) - EasyVolcap: Accelerating Neural Volumetric Video Research [69.59671164891725]
Volumetric video is a technology that digitally records dynamic events such as artistic performances, sporting events, and remote conversations.
EasyVolcap is a Python & Pytorch library for unifying the process of multi-view data processing, 4D scene reconstruction, and efficient dynamic volumetric video rendering.
arXiv Detail & Related papers (2023-12-11T17:59:46Z) - VEATIC: Video-based Emotion and Affect Tracking in Context Dataset [34.77364955121413]
We introduce a brand new large dataset, the Video-based Emotion and Affect Tracking in Context dataset (VEATIC)
VEATIC has 124 video clips from Hollywood movies, documentaries, and home videos with continuous valence and arousal ratings of each frame via real-time annotation.
Along with the dataset, we propose a new computer vision task to infer the affect of the selected character via both context and character information in each video frame.
arXiv Detail & Related papers (2023-09-13T06:31:35Z) - MAD: A Scalable Dataset for Language Grounding in Videos from Movie
Audio Descriptions [109.84031235538002]
We present MAD (Movie Audio Descriptions), a novel benchmark that departs from the paradigm of augmenting existing video datasets with text annotations.
MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video and exhibits a significant reduction in the currently diagnosed biases for video-language grounding datasets.
arXiv Detail & Related papers (2021-12-01T11:47:09Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - QuerYD: A video dataset with high-quality text and audio narrations [85.6468286746623]
We introduce QuerYD, a new large-scale dataset for retrieval and event localisation in video.
A unique feature of our dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description.
The dataset is based on YouDescribe, a volunteer project that assists visually-impaired people by attaching voiced narrations to existing YouTube videos.
arXiv Detail & Related papers (2020-11-22T17:33:44Z) - The IKEA ASM Dataset: Understanding People Assembling Furniture through
Actions, Objects and Pose [108.21037046507483]
IKEA ASM is a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose.
We benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset.
The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
arXiv Detail & Related papers (2020-07-01T11:34:46Z) - Learning Disentangled Representations of Video with Missing Data [17.34839550557689]
We present Disentangled Imputed Video autoEncoder (DIVE), a deep generative model that imputes and predicts future video frames in the presence of missing data.
Specifically, DIVE introduces a missingness latent variable, disentangles the hidden video representations into static and dynamic appearance, pose, and missingness factors for each object.
On a moving MNIST dataset with various missing scenarios, DIVE outperforms the state of the art baselines by a substantial margin.
arXiv Detail & Related papers (2020-06-23T23:54:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.