Related papers: Multimodal Datasets and Benchmarks for Reasoning about Dynamic Spatio-Temporality in Everyday Environments

Multimodal Datasets and Benchmarks for Reasoning about Dynamic Spatio-Temporality in Everyday Environments

URL: http://arxiv.org/abs/2408.11347v2
Date: Tue, 17 Sep 2024 01:30:36 GMT
Title: Multimodal Datasets and Benchmarks for Reasoning about Dynamic Spatio-Temporality in Everyday Environments
Authors: Takanori Ugai, Kensho Hara, Shusaku Egami, Ken Fukuda,
Abstract summary: Our dataset measures the extent to which a robot can understand human behavior and the environment in a home setting. Preliminary experiments suggest our dataset is useful in measuring AI's comprehension of daily life.
Score: 4.024850952459759
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We used a 3D simulator to create artificial video data with standardized annotations, aiming to aid in the development of Embodied AI. Our question answering (QA) dataset measures the extent to which a robot can understand human behavior and the environment in a home setting. Preliminary experiments suggest our dataset is useful in measuring AI's comprehension of daily life. \end{abstract}

Related papers

Perceiving and Acting in First-Person: A Dataset and Benchmark for Egocentric Human-Object-Human Interactions [110.43343503158306]
This paper embeds the manual-assisted task into a vision-language-action framework, where the assistant provides services to the instructor following egocentric vision and commands.<n>Under this setting, we accomplish InterVLA, the first large-scale human-object-human interaction dataset with 11.4 hours and 1.2M frames of multimodal data.<n>We establish novel benchmarks on egocentric human motion estimation, interaction synthesis, and interaction prediction with comprehensive analysis.
arXiv Detail & Related papers (2025-08-06T17:46:23Z)
Multi-Transmotion: Pre-trained Model for Human Motion Prediction [68.87010221355223]
Multi-Transmotion is an innovative transformer-based model designed for cross-modality pre-training. Our methodology demonstrates competitive performance across various datasets on several downstream tasks.
arXiv Detail & Related papers (2024-11-04T23:15:21Z)
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots [25.650235551519952]
We present RoboCasa, a large-scale simulation framework for training generalist robots in everyday environments. We provide thousands of 3D assets across over 150 object categories and dozens of interactable furniture and appliances. Our experiments show a clear scaling trend in using synthetically generated robot data for large-scale imitation learning.
arXiv Detail & Related papers (2024-06-04T17:41:31Z)
Interactive Planning Using Large Language Models for Partially Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks. We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z)
Care3D: An Active 3D Object Detection Dataset of Real Robotic-Care Environments [52.425280825457385]
This paper introduces an annotated dataset of real environments. The captured environments represent areas which are already in use in the field of robotic health care research. We also provide ground truth data within one room, for assessing SLAM algorithms running directly on a health care robot.
arXiv Detail & Related papers (2023-10-09T10:35:37Z)
Development of a Realistic Crowd Simulation Environment for Fine-grained Validation of People Tracking Methods [0.7223361655030193]
This work develops an extension of crowd simulation (named CrowdSim2) and prove its usability in the application of people-tracking algorithms. The simulator is developed using the very popular Unity 3D engine with particular emphasis on the aspects of realism in the environment. Three methods of tracking were used to validate generated dataset: IOU-Tracker, Deep-Sort, and Deep-TAMA.
arXiv Detail & Related papers (2023-04-26T09:29:58Z)
HabitatDyn Dataset: Dynamic Object Detection to Kinematics Estimation [16.36110033895749]
We propose the dataset HabitatDyn, which contains both synthetic RGB videos, semantic labels, and depth information, as well as kinetics information. HabitatDyn was created from the perspective of a mobile robot with a moving camera, and contains 30 scenes featuring six different types of moving objects with varying velocities.
arXiv Detail & Related papers (2023-04-21T09:57:35Z)
Video-based Pose-Estimation Data as Source for Transfer Learning in Human Activity Recognition [71.91734471596433]
Human Activity Recognition (HAR) using on-body devices identifies specific human actions in unconstrained environments. Previous works demonstrated that transfer learning is a good strategy for addressing scenarios with scarce data. This paper proposes using datasets intended for human-pose estimation as a source for transfer learning.
arXiv Detail & Related papers (2022-12-02T18:19:36Z)
Palm up: Playing in the Latent Manifold for Unsupervised Pretraining [31.92145741769497]
We propose an algorithm that exhibits an exploratory behavior whilst it utilizes large diverse datasets. Our key idea is to leverage deep generative models that are pretrained on static datasets and introduce a dynamic model in the latent space. We then employ an unsupervised reinforcement learning algorithm to explore in this environment and perform unsupervised representation learning on the collected data.
arXiv Detail & Related papers (2022-10-19T22:26:12Z)
ProcTHOR: Large-Scale Embodied AI Using Procedural Generation [55.485985317538194]
ProcTHOR is a framework for procedural generation of Embodied AI environments. We demonstrate state-of-the-art results across 6 embodied AI benchmarks for navigation, rearrangement, and arm manipulation.
arXiv Detail & Related papers (2022-06-14T17:09:35Z)
AcinoSet: A 3D Pose Estimation Dataset and Baseline Models for Cheetahs in the Wild [51.35013619649463]
We present an extensive dataset of free-running cheetahs in the wild, called AcinoSet. The dataset contains 119,490 frames of multi-view synchronized high-speed video footage, camera calibration files and 7,588 human-annotated frames. The resulting 3D trajectories, human-checked 3D ground truth, and an interactive tool to inspect the data is also provided.
arXiv Detail & Related papers (2021-03-24T15:54:11Z)
Few-Shot Visual Grounding for Natural Human-Robot Interaction [0.0]
We propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user. At the core of our system, we employ a multi-modal deep neural network for visual grounding. We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets.
arXiv Detail & Related papers (2021-03-17T15:24:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.