A New People-Object Interaction Dataset and NVS Benchmarks
- URL: http://arxiv.org/abs/2409.12980v1
- Date: Tue, 3 Sep 2024 08:54:15 GMT
- Title: A New People-Object Interaction Dataset and NVS Benchmarks
- Authors: Shuai Guo, Houqiang Zhong, Qiuwen Wang, Ziyu Chen, Yijie Gao, Jiajing Yuan, Chenyu Zhang, Rong Xie, Li Song,
- Abstract summary: We introduce a new people-object interaction dataset that comprises 38 series of 30-view multi-person or single-person RGB-D video sequences.
Video sequences are captured by 30 Kinect Azures, uniformly surrounding the scene, each in 4K resolution 25 FPS, and lasting for 1$sim$19 seconds.
- Score: 16.909004722367644
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recently, NVS in human-object interaction scenes has received increasing attention. Existing human-object interaction datasets mainly consist of static data with limited views, offering only RGB images or videos, mostly containing interactions between a single person and objects. Moreover, these datasets exhibit complexities in lighting environments, poor synchronization, and low resolution, hindering high-quality human-object interaction studies. In this paper, we introduce a new people-object interaction dataset that comprises 38 series of 30-view multi-person or single-person RGB-D video sequences, accompanied by camera parameters, foreground masks, SMPL models, some point clouds, and mesh files. Video sequences are captured by 30 Kinect Azures, uniformly surrounding the scene, each in 4K resolution 25 FPS, and lasting for 1$\sim$19 seconds. Meanwhile, we evaluate some SOTA NVS models on our dataset to establish the NVS benchmarks. We hope our work can inspire further research in humanobject interaction.
Related papers
- iTACO: Interactable Digital Twins of Articulated Objects from Casually Captured RGBD Videos [52.398752421673144]
We focus on motion analysis and part-level segmentation of an articulated object from a casually captured RGBD video shot with a hand-held camera.<n>A casually captured video of an interaction with an articulated object is easy to obtain at scale using smartphones.<n>We introduce iTACO: a coarse-to-fine framework that infers joint parameters and segments movable parts of the object from a dynamic RGBD video.
arXiv Detail & Related papers (2025-06-10T01:41:46Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - HOI-M3:Capture Multiple Humans and Objects Interaction within Contextual Environment [43.6454394625555]
HOI-M3 is a novel large-scale dataset for modeling the interactions of Multiple huMans and Multiple objects.
It provides accurate 3D tracking for both humans and objects from dense RGB and object-mounted IMU inputs.
arXiv Detail & Related papers (2024-03-30T09:24:25Z) - InterCap: Joint Markerless 3D Tracking of Humans and Objects in
Interaction [0.0]
InterCap reconstructs whole-bodies and objects from multi-view RGB-D data.
Azure Kinect sensors allow us to set up a simple multi-view RGB-D capture system.
InterCap has 223 RGB-D videos, resulting in 67,357 multi-view frames, each containing 6 RGB-D images.
arXiv Detail & Related papers (2022-09-26T00:46:49Z) - Learning Dynamic View Synthesis With Few RGBD Cameras [60.36357774688289]
We propose to utilize RGBD cameras to synthesize free-viewpoint videos of dynamic indoor scenes.
We generate point clouds from RGBD frames and then render them into free-viewpoint videos via a neural feature.
We introduce a simple Regional Depth-Inpainting module that adaptively inpaints missing depth values to render complete novel views.
arXiv Detail & Related papers (2022-04-22T03:17:35Z) - BEHAVE: Dataset and Method for Tracking Human Object Interactions [105.77368488612704]
We present the first full body human- object interaction dataset with multi-view RGBD frames and corresponding 3D SMPL and object fits along with the annotated contacts between them.
We use this data to learn a model that can jointly track humans and objects in natural environments with an easy-to-use portable multi-camera setup.
arXiv Detail & Related papers (2022-04-14T13:21:19Z) - EgoBody: Human Body Shape, Motion and Social Interactions from
Head-Mounted Devices [76.50816193153098]
EgoBody is a novel large-scale dataset for social interactions in complex 3D scenes.
We employ Microsoft HoloLens2 headsets to record rich egocentric data streams including RGB, depth, eye gaze, head and hand tracking.
To obtain accurate 3D ground-truth, we calibrate the headset with a multi-Kinect rig and fit expressive SMPL-X body meshes to multi-view RGB-D frames.
arXiv Detail & Related papers (2021-12-14T18:41:28Z) - Spatio-Temporal Interaction Graph Parsing Networks for Human-Object
Interaction Recognition [55.7731053128204]
In given video-based Human-Object Interaction scene, modeling thetemporal relationship between humans and objects are the important cue to understand the contextual information presented in the video.
With the effective-temporal relationship modeling, it is possible not only to uncover contextual information in each frame but also directly capture inter-time dependencies.
The full use of appearance features, spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance.
arXiv Detail & Related papers (2021-08-19T11:57:27Z) - D3D-HOI: Dynamic 3D Human-Object Interactions from Videos [49.38319295373466]
We introduce D3D-HOI: a dataset of monocular videos with ground truth annotations of 3D object pose, shape and part motion during human-object interactions.
Our dataset consists of several common articulated objects captured from diverse real-world scenes and camera viewpoints.
We leverage the estimated 3D human pose for more accurate inference of the object spatial layout and dynamics.
arXiv Detail & Related papers (2021-08-19T00:49:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.