DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos
- URL: http://arxiv.org/abs/2503.08344v1
- Date: Tue, 11 Mar 2025 11:55:04 GMT
- Title: DIV-FF: Dynamic Image-Video Feature Fields For Environment Understanding in Egocentric Videos
- Authors: Lorenzo Mur-Labadia, Josechu Guerrero, Ruben Martinez-Cantin,
- Abstract summary: We introduce Dynamic Image-Video Feature Fields (DIV FF), a framework that decomposes the egocentric scene into persistent, dynamic, and actor based components.<n>Our model enables detailed segmentation, captures affordances, understands the surroundings and maintains consistent understanding over time.
- Score: 3.2771631221674333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Environment understanding in egocentric videos is an important step for applications like robotics, augmented reality and assistive technologies. These videos are characterized by dynamic interactions and a strong dependence on the wearer engagement with the environment. Traditional approaches often focus on isolated clips or fail to integrate rich semantic and geometric information, limiting scene comprehension. We introduce Dynamic Image-Video Feature Fields (DIV FF), a framework that decomposes the egocentric scene into persistent, dynamic, and actor based components while integrating both image and video language features. Our model enables detailed segmentation, captures affordances, understands the surroundings and maintains consistent understanding over time. DIV-FF outperforms state-of-the-art methods, particularly in dynamically evolving scenarios, demonstrating its potential to advance long term, spatio temporal scene understanding.
Related papers
- CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders [6.159948396712944]
CrossVideoMAE learns both video-level and frame-level richtemporal representations and semantic attributes.<n>Our method integrates mutualtemporal information from videos with spatial information from sampled frames.<n>This is critical for acquiring rich, label-free guiding signals from both video and frame image modalities in a self-supervised manner.
arXiv Detail & Related papers (2025-02-08T06:15:39Z) - DynVFX: Augmenting Real Videos with Dynamic Content [19.393567535259518]
We present a method for augmenting real-world videos with newly generated dynamic content.
Given an input video and a simple user-provided text instruction describing the desired content, our method synthesizes dynamic objects or complex scene effects.
The position, appearance, and motion of the new content are seamlessly integrated into the original footage.
arXiv Detail & Related papers (2025-02-05T21:14:55Z) - Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs [66.57518905079262]
VideoMind organizes critical video moments into aologically structured semantic graph.<n>"Mind Palace" organizes key information through (i) hand-object tracking, (ii) clustered zones activity representing specific areas of recurring activities, and (iii) environment layout mapping.
arXiv Detail & Related papers (2025-01-08T08:15:29Z) - One-Shot Learning Meets Depth Diffusion in Multi-Object Videos [0.0]
This paper introduces a novel depth-conditioning approach that enables the generation of coherent and diverse videos from just a single text-video pair.
Our method fine-tunes the pre-trained model to capture continuous motion by employing custom-designed spatial and temporal attention mechanisms.
During inference, we use the DDIM inversion to provide structural guidance for video generation.
arXiv Detail & Related papers (2024-08-29T16:58:10Z) - EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone.
We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z) - HENASY: Learning to Assemble Scene-Entities for Egocentric Video-Language Model [9.762722976833581]
Current models rely extensively on instance-level alignment between video and language modalities.
We take an inspiration from human perception and explore a compositional approach for ego video representation.
arXiv Detail & Related papers (2024-06-01T05:41:12Z) - HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding [8.10024991952397]
Existing methods focus on complex interactivities while leveraging a simple relationship model.
We propose a new approach named Hierarchical Interlacement Graph (HIG), which leverages a unified layer and graph within a hierarchical structure.
Our approach demonstrates superior performance to other methods through extensive experiments conducted in various scenarios.
arXiv Detail & Related papers (2023-12-05T18:47:19Z) - DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors [63.43133768897087]
We propose a method to convert open-domain images into animated videos.
The key idea is to utilize the motion prior to text-to-video diffusion models by incorporating the image into the generative process as guidance.
Our proposed method can produce visually convincing and more logical & natural motions, as well as higher conformity to the input image.
arXiv Detail & Related papers (2023-10-18T14:42:16Z) - DynIBaR: Neural Dynamic Image-Based Rendering [79.44655794967741]
We address the problem of synthesizing novel views from a monocular video depicting a complex dynamic scene.
We adopt a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views.
We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets.
arXiv Detail & Related papers (2022-11-20T20:57:02Z) - EgoEnv: Human-centric environment representations from egocentric video [60.34649902578047]
First-person video highlights a camera-wearer's activities in the context of their persistent environment.
Current video understanding approaches reason over visual features from short video clips that are detached from the underlying physical space.
We present an approach that links egocentric video and the environment by learning representations that are predictive of the camera-wearer's (potentially unseen) local surroundings.
arXiv Detail & Related papers (2022-07-22T22:39:57Z) - End-to-end Multi-modal Video Temporal Grounding [105.36814858748285]
We propose a multi-modal framework to extract complementary information from videos.
We adopt RGB images for appearance, optical flow for motion, and depth maps for image structure.
We conduct experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-12T17:58:10Z) - Relation-aware Hierarchical Attention Framework for Video Question
Answering [6.312182279855817]
We propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos.
In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features.
We consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer.
arXiv Detail & Related papers (2021-05-13T09:35:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.