ERMV: Editing 4D Robotic Multi-view images to enhance embodied agents
- URL: http://arxiv.org/abs/2507.17462v1
- Date: Wed, 23 Jul 2025 12:41:11 GMT
- Title: ERMV: Editing 4D Robotic Multi-view images to enhance embodied agents
- Authors: Chang Nie, Guangming Wang, Zhe Lie, Hesheng Wang,
- Abstract summary: ERMV ( Robotic Multi-View 4D data framework) efficiently edits an entire multi-view sequence based on single-frame editing and robot state conditions.<n>Emerged data significantly boosts robustness and guidance of models in both simulated and real-world environments.
- Score: 14.75400720374728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Robot imitation learning relies on 4D multi-view sequential images. However, the high cost of data collection and the scarcity of high-quality data severely constrain the generalization and application of embodied intelligence policies like Vision-Language-Action (VLA) models. Data augmentation is a powerful strategy to overcome data scarcity, but methods for editing 4D multi-view sequential images for manipulation tasks are currently lacking. Thus, we propose ERMV (Editing Robotic Multi-View 4D data), a novel data augmentation framework that efficiently edits an entire multi-view sequence based on single-frame editing and robot state conditions. This task presents three core challenges: (1) maintaining geometric and appearance consistency across dynamic views and long time horizons; (2) expanding the working window with low computational costs; and (3) ensuring the semantic integrity of critical objects like the robot arm. ERMV addresses these challenges through a series of innovations. First, to ensure spatio-temporal consistency in motion blur, we introduce a novel Epipolar Motion-Aware Attention (EMA-Attn) mechanism that learns pixel shift caused by movement before applying geometric constraints. Second, to maximize the editing working window, ERMV pioneers a Sparse Spatio-Temporal (STT) module, which decouples the temporal and spatial views and remodels a single-frame multi-view problem through sparse sampling of the views to reduce computational demands. Third, to alleviate error accumulation, we incorporate a feedback intervention Mechanism, which uses a Multimodal Large Language Model (MLLM) to check editing inconsistencies and request targeted expert guidance only when necessary. Extensive experiments demonstrate that ERMV-augmented data significantly boosts the robustness and generalization of VLA models in both simulated and real-world environments.
Related papers
- Auto-Regressively Generating Multi-View Consistent Images [10.513203377236744]
We propose the Multi-View Auto-Regressive (textbfMV-AR) method to generate consistent multi-view images from arbitrary prompts.<n>When generating widely-separated views, MV-AR can utilize all its preceding views to extract effective reference information.<n>Experiments demonstrate the performance and versatility of our MV-AR, which consistently generates consistent multi-view images.
arXiv Detail & Related papers (2025-06-23T11:28:37Z) - MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning [54.47710436807661]
MORSE-500 is a video benchmark composed of 500 fully scripted clips embedded questions spanning six complementary reasoning categories.<n>Each instance is generated using deterministic Python scripts (Manim, Matplotlib, MoviePy), generative video models, and real footage.<n>Unlike static benchmarks that become obsolete once saturated, MORSE-500 is built to evolve.
arXiv Detail & Related papers (2025-06-05T19:12:45Z) - VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z) - ORV: 4D Occupancy-centric Robot Video Generation [33.360345403049685]
Acquiring real-world robotic simulation data through teleoperation is notoriously time-consuming and labor-intensive.<n>We propose ORV, an Occupancy-centric Robot Video generation framework, which utilizes 4D semantic occupancy sequences as a fine-grained representation.<n>By leveraging occupancy-based representations, ORV enables seamless translation of simulation data into photorealistic robot videos, while ensuring high temporal consistency and precise controllability.
arXiv Detail & Related papers (2025-06-03T17:00:32Z) - ReAgent-V: A Reward-Driven Multi-Agent Framework for Video Understanding [71.654781631463]
ReAgent-V is a novel agentic video understanding framework.<n>It integrates efficient frame selection with real-time reward generation during inference.<n>Extensive experiments on 12 datasets demonstrate significant gains in generalization and reasoning.
arXiv Detail & Related papers (2025-06-02T04:23:21Z) - DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z) - Marmot: Multi-Agent Reasoning for Multi-Object Self-Correcting in Improving Image-Text Alignment [55.74860093731475]
Marmot is a novel framework that employs Multi-Agent Reasoning for Multi-Object Self-Correcting.<n>We construct a multi-agent self-correcting system featuring a decision-execution-verification mechanism.<n>Experiments demonstrate that Marmot significantly improves accuracy in object counting, attribute assignment, and spatial relationships.
arXiv Detail & Related papers (2025-04-10T16:54:28Z) - PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model [23.768571323272152]
PartRM is a novel 4D reconstruction framework that simultaneously models appearance, geometry, and part-level motion from multi-view images of a static object.<n>We introduce the PartDrag-4D dataset, providing multi-view observations of part-level dynamics across over 20,000 states.<n> Experimental results show that PartRM establishes a new state-of-the-art in part-level motion learning and can be applied in manipulation tasks in robotics.
arXiv Detail & Related papers (2025-03-25T17:59:58Z) - VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval [8.908777234657046]
Large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains.<n>Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules.<n> Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2024-12-02T14:45:53Z) - Large Motion Model for Unified Multi-Modal Motion Generation [50.56268006354396]
Large Motion Model (LMM) is a motion-centric, multi-modal framework that unifies mainstream motion generation tasks into a generalist model.
LMM tackles these challenges from three principled aspects.
arXiv Detail & Related papers (2024-04-01T17:55:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.