Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments
- URL: http://arxiv.org/abs/2506.02845v3
- Date: Mon, 13 Oct 2025 08:22:34 GMT
- Title: Go Beyond Earth: Understanding Human Actions and Scenes in Microgravity Environments
- Authors: Di Wen, Lei Qi, Kunyu Peng, Kailun Yang, Fei Teng, Ao Luo, Jia Fu, Yufan Chen, Ruiping Liu, Yitian Shi, M. Saquib Sarfraz, Rainer Stiefelhagen,
- Abstract summary: MicroG-4M is the first benchmark for semantic understanding of human activities in microgravity.<n>The dataset includes 4,759 clips covering 50 actions, 1,238 context-rich captions, and over 7,000 question-answer pairs on astronaut activities and scene understanding.<n>MicroG-4M supports three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering.
- Score: 56.75900993615519
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite substantial progress in video understanding, most existing datasets are limited to Earth's gravitational conditions. However, microgravity alters human motion, interactions, and visual semantics, revealing a critical gap for real-world vision systems. This presents a challenge for domain-robust video understanding in safety-critical space applications. To address this, we introduce MicroG-4M, the first benchmark for spatio-temporal and semantic understanding of human activities in microgravity. Constructed from real-world space missions and cinematic simulations, the dataset includes 4,759 clips covering 50 actions, 1,238 context-rich captions, and over 7,000 question-answer pairs on astronaut activities and scene understanding. MicroG-4M supports three core tasks: fine-grained multi-label action recognition, temporal video captioning, and visual question answering, enabling a comprehensive evaluation of both spatial localization and semantic reasoning in microgravity contexts. We establish baselines using state-of-the-art models. All data, annotations, and code are available at https://github.com/LEI-QI-233/HAR-in-Space.
Related papers
- Learning Situated Awareness in the Real World [63.75211123289058]
SAW-Bench is a novel benchmark for evaluating egocentric situated awareness using real-world videos.<n>It probes a model's observer-centric understanding with six different awareness tasks.<n>Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash.
arXiv Detail & Related papers (2026-02-18T18:22:52Z) - Articulated 3D Scene Graphs for Open-World Mobile Manipulation [55.97942733699124]
We present MoMa-SG, a framework for building semantic-kinematic 3D scene graphs of articulated scenes.<n>We estimate articulation models using a novel unified twist estimation formulation.<n>We also introduce the novel Arti4D-Semantic dataset.
arXiv Detail & Related papers (2026-02-18T10:40:35Z) - Know-Show: Benchmarking Video-Language Models on Spatio-Temporal Grounded Reasoning [18.15310805625469]
We present Know-Show, a new benchmark designed to evaluate multimodal Video-Language Models (Video-LMs)<n>Know-Show unifies reasoning and localization within a single evaluation framework consisting of five scenarios across spatial (person, object, person-object, and hand-object) and temporal dimensions.<n>Built from Charades, Action Genome, and Ego4D with 2.5K human-language questions, the benchmark exposes significant gaps between current Video-LMs and human reasoning.<n>To bridge this gap, we propose GRAM, a training-free plug-in that augments Video-LMs with fine-grained grounding
arXiv Detail & Related papers (2025-12-05T08:15:49Z) - MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models [45.450035386882824]
Vision Language Models (VLMs) perform well on standard video tasks but struggle with physics-driven reasoning involving motion dynamics and spatial interactions.<n>We present an approach that addresses this gap by translating physical-world context cues into interpretable representations aligned with VLMs' perception, comprehension, and reasoning.
arXiv Detail & Related papers (2025-11-23T09:43:44Z) - Visual Grounding from Event Cameras [26.670030443187482]
We introduce Talk2Event, the first large-scale benchmark for language-driven object grounding using event data.<n>Talk2Event comprises 5,567 scenes, 13,458 annotated objects, and more than 30,000 carefully validated referring expressions.<n>We envision Talk2Event as a foundation for advancing multimodal and temporally-aware perception, with applications spanning robotics, human-AI interaction, and so on.
arXiv Detail & Related papers (2025-09-11T16:21:59Z) - LRR-Bench: Left, Right or Rotate? Vision-Language models Still Struggle With Spatial Understanding Tasks [22.011855291417856]
It remains underexplored how Vision-Language Models (VLMs) recognize spatial relationships and perceive spatial movement.<n>In this work, we introduce a spatial evaluation pipeline and construct a corresponding benchmark.<n>We conduct experiments on multiple state-of-the-art VLMs and observe that there is significant room for improvement.
arXiv Detail & Related papers (2025-07-27T08:31:24Z) - EmbRACE-3K: Embodied Reasoning and Action in Complex Environments [48.32142591866083]
EmRACE-3K is a dataset of over 3,000 language-guided tasks constructed using Unreal Engine and the UnrealCV-Zoo framework.<n>We establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution.<n>In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments.
arXiv Detail & Related papers (2025-07-14T17:59:46Z) - Martian World Models: Controllable Video Synthesis with Physically Accurate 3D Reconstructions [116.56517155163716]
We propose a data curation pipeline that reconstructs 3D Martian environments from real stereo navigation images.<n>A Martian terrain video generator, MarsGen, synthesizes novel videos visually realistic and geometrically consistent with the 3D structure encoded in the data.<n>Our approach outperforms video synthesis models trained on terrestrial datasets, achieving superior visual fidelity and 3D structural consistency.
arXiv Detail & Related papers (2025-07-10T17:54:27Z) - SURPRISE3D: A Dataset for Spatial Understanding and Reasoning in Complex 3D Scenes [105.8644620467576]
We introduce Stextscurprise3D, a novel dataset designed to evaluate language-guided spatial reasoning segmentation in complex 3D scenes.<n>Stextscurprise3D consists of more than 200k vision language pairs across 900+ detailed indoor scenes from ScanNet++ v2.<n>The dataset contains 89k+ human-annotated spatial queries deliberately crafted without object name.
arXiv Detail & Related papers (2025-07-10T14:01:24Z) - HOSIG: Full-Body Human-Object-Scene Interaction Generation with Hierarchical Scene Perception [57.37135310143126]
HO SIG is a novel framework for synthesizing full-body interactions through hierarchical scene perception.<n>Our framework supports unlimited motion length through autoregressive generation and requires minimal manual intervention.<n>This work bridges the critical gap between scene-aware navigation and dexterous object manipulation.
arXiv Detail & Related papers (2025-06-02T12:08:08Z) - Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning.<n>We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z) - SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability [58.46310813774538]
Large language models (LMLMs) have made remarkable progress in either temporal or spatial localization.<n>However they struggle to perform-temporal video grounding.<n>This limitation stems from two major challenges.<n>We introduce SpaceLM, a MLLMVL endowed with temporal-temporal video grounding.
arXiv Detail & Related papers (2025-03-18T07:40:36Z) - Free-form language-based robotic reasoning and grasping [9.866754994504324]
Vision-Language Models (VLMs) have demonstrated remarkable reasoning capabilities across both text and images.<n>We propose a novel method, FreeGrasp, leveraging the pre-trained VLMs' world knowledge to reason about human instructions and object spatial arrangements.<n>Our method detects all objects as keypoints and uses these keypoints to annotate marks on images, aiming to facilitate GPT-4o's zero-shot spatial reasoning.
arXiv Detail & Related papers (2025-03-17T11:41:16Z) - RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics [26.42651735582044]
We introduce RoboSpatial, a large-scale dataset for spatial understanding in robotics.<n>It consists of real indoor and tabletop scenes, captured as 3D scans and egocentric images, and annotated with rich spatial information relevant to robotics.<n>Our experiments show that models trained with RoboSpatial outperform baselines on downstream tasks such as spatial affordance prediction, spatial relationship prediction, and robot manipulation.
arXiv Detail & Related papers (2024-11-25T16:21:34Z) - Space-LLaVA: a Vision-Language Model Adapted to Extraterrestrial Applications [14.89043819048682]
We see three core challenges in the future of space robotics that motivate building FM for space robotics.<n>As a firststep towards a space foundation model model, we augment three extraterrestrial databases with fine-grained annotations.<n>We fine-tune a Vision-Language Model to adapt to the semantic features in an extraterrestrial environment.
arXiv Detail & Related papers (2024-08-12T05:07:24Z) - Nymeria: A Massive Collection of Multimodal Egocentric Daily Motion in the Wild [66.34146236875822]
The Nymeria dataset is a large-scale, diverse, richly annotated human motion dataset collected in the wild with multiple multimodal egocentric devices.
It contains 1200 recordings of 300 hours of daily activities from 264 participants across 50 locations, travelling a total of 399Km.
The motion-language descriptions provide 310.5K sentences in 8.64M words from a vocabulary size of 6545.
arXiv Detail & Related papers (2024-06-14T10:23:53Z) - CIRCLE: Capture In Rich Contextual Environments [69.97976304918149]
We propose a novel motion acquisition system in which the actor perceives and operates in a highly contextual virtual world.
We present CIRCLE, a dataset containing 10 hours of full-body reaching motion from 5 subjects across nine scenes.
We use this dataset to train a model that generates human motion conditioned on scene information.
arXiv Detail & Related papers (2023-03-31T09:18:12Z) - Towards Robust Monocular Visual Odometry for Flying Robots on Planetary
Missions [49.79068659889639]
Ingenuity, that just landed on Mars, will mark the beginning of a new era of exploration unhindered by traversability.
We present an advanced robust monocular odometry algorithm that uses efficient optical flow tracking.
We also present a novel approach to estimate the current risk of scale drift based on a principal component analysis of the relative translation information matrix.
arXiv Detail & Related papers (2021-09-12T12:52:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.