Related papers: SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos

SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos

URL: http://arxiv.org/abs/2504.07867v1
Date: Thu, 10 Apr 2025 15:43:10 GMT
Title: SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos
Authors: Joshua Li, Fernando Jose Pena Cantu, Emily Yu, Alexander Wong, Yuchen Cui, Yuhao Chen,
Abstract summary: Current models for VidSGG require extensive training to produce scene graphs.<n>We propose SAMJAM, a zero-shot pipeline that combines SAM2's temporal tracking with Gemini's semantic understanding.<n>We empirically demonstrate that SAMJAM outperforms Gemini by 8.33% in mean recall on the EPIC-KITCHENS and EPIC-KITCHENS-100 datasets.
Score: 93.29815497662877
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Scene Graph Generation (VidSGG) is an important topic in understanding dynamic kitchen environments. Current models for VidSGG require extensive training to produce scene graphs. Recently, Vision Language Models (VLM) and Vision Foundation Models (VFM) have demonstrated impressive zero-shot capabilities in a variety of tasks. However, VLMs like Gemini struggle with the dynamics for VidSGG, failing to maintain stable object identities across frames. To overcome this limitation, we propose SAMJAM, a zero-shot pipeline that combines SAM2's temporal tracking with Gemini's semantic understanding. SAM2 also improves upon Gemini's object grounding by producing more accurate bounding boxes. In our method, we first prompt Gemini to generate a frame-level scene graph. Then, we employ a matching algorithm to map each object in the scene graph with a SAM2-generated or SAM2-propagated mask, producing a temporally-consistent scene graph in dynamic environments. Finally, we repeat this process again in each of the following frames. We empirically demonstrate that SAMJAM outperforms Gemini by 8.33% in mean recall on the EPIC-KITCHENS and EPIC-KITCHENS-100 datasets.

Related papers

VideoAnchor: Reinforcing Subspace-Structured Visual Cues for Coherent Visual-Spatial Reasoning [69.64660280965971]
VideoAnchor is a plug-and-play module that leverages subspace affinities to reinforce visual cues across frames without retraining.<n>We show consistent performance gains on benchmarks with InternVL2-8B and Q2.5VL-72B.<n>Our codes will be made public at https://github.com/feufhd/VideoAnchor.
arXiv Detail & Related papers (2025-09-29T17:54:04Z)
OOTSM: A Decoupled Linguistic Framework for Effective Scene Graph Anticipation [14.938566273427098]
Scene Graph Anticipation (SGA) involves predicting future scene graphs from video clips.<n>Existing SGA approaches leverage visual cues, often struggling to integrate valuable commonsense knowledge.<n>We propose a new approach to better understand the objects, concepts, and relationships in a scene graph.
arXiv Detail & Related papers (2025-09-06T09:35:15Z)
VideoMolmo: Spatio-Temporal Grounding Meets Pointing [66.19964563104385]
VideoMolmo is a model tailored for fine-grained pointing of video sequences.<n>A novel temporal mask fusion employs SAM2 for bidirectional point propagation.<n>To evaluate the generalization of VideoMolmo, we introduce VPoMolS-temporal, a challenging out-of-distribution benchmark spanning five real-world scenarios.
arXiv Detail & Related papers (2025-06-05T17:59:29Z)
LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study [12.90392791734461]
Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks.<n>Recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene.<n>We introduce Text-Scene Graph (TSG) Bench, a benchmark designed to assess LLMs' ability to understand scene graphs.
arXiv Detail & Related papers (2025-05-26T04:45:12Z)
CamSAM2: Segment Anything Accurately in Camouflaged Videos [37.0152845263844]
We propose Camouflaged SAM2 (CamSAM2) to handle camouflaged scenes without modifying SAM2's parameters.<n>To make full use of fine-grained and high-resolution features from the current frame and previous frames, we propose implicit object-aware fusion (IOF) and explicit object-aware fusion (EOF) modules.<n>While CamSAM2 only adds negligible learnable parameters to SAM2, it substantially outperforms SAM2 on three VCOS datasets.
arXiv Detail & Related papers (2025-03-25T14:58:52Z)
Fake It To Make It: Virtual Multiviews to Enhance Monocular Indoor Semantic Scene Completion [0.8669877024051931]
Monocular Indoor Semantic Scene Completion aims to reconstruct a 3D semantic occupancy map from a single RGB image of an indoor scene. We introduce an innovative approach that leverages novel view synthesis and multiview fusion. We demonstrate IoU score improvements of up to 2.8% for Scene Completion and 4.9% for Semantic Scene Completion when integrated with existing SSC networks on the NYUv2 dataset.
arXiv Detail & Related papers (2025-03-07T02:09:38Z)
Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation [52.337472185022136]
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description. We propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation. We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art consistency.
arXiv Detail & Related papers (2025-01-06T14:49:26Z)
Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training [61.75337990107149]
We introduce Generate Any Scene, a data engine that enumerates scene graphs representing an array of possible visual scenes.<n>Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation.<n>It also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment.
arXiv Detail & Related papers (2024-12-11T09:17:39Z)
HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation [7.027942200231825]
Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames.<n>We propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships.<n>We introduce a new Video Scene Graph Reasoning dataset featuring 1.9M frames from third-person, egocentric, and drone views.
arXiv Detail & Related papers (2024-11-27T04:24:39Z)
TESGNN: Temporal Equivariant Scene Graph Neural Networks for Efficient and Robust Multi-View 3D Scene Understanding [8.32401190051443]
We propose Temporal Equivariant Scene Graph Neural Network (TESGNN), consisting of two key components.<n>ESGNN extracts information from 3D point clouds to generate scene graph while preserving crucial symmetry properties.<n>We show that leveraging the symmetry-preserving property produces a more stable and accurate global scene representation.
arXiv Detail & Related papers (2024-11-15T15:39:04Z)
EPIC Fields: Marrying 3D Geometry and Video Understanding [76.60638761589065]
EPIC Fields is an augmentation of EPIC-KITCHENS with 3D camera information. It removes the complex and expensive step of reconstructing cameras using photogrammetry. It reconstructs 96% of videos in EPICKITCHENS, registering 19M frames in 99 hours recorded in 45 kitchens.
arXiv Detail & Related papers (2023-06-14T20:33:49Z)
Bringing Image Scene Structure to Video via Frame-Clip Consistency of Object Tokens [93.98605636451806]
StructureViT shows how utilizing the structure of a small number of images only available during training can improve a video model. SViT shows strong performance improvements on multiple video understanding tasks and datasets.
arXiv Detail & Related papers (2022-06-13T17:45:05Z)
Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video. We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction. We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z)
Compositional Video Synthesis with Action Graphs [112.94651460161992]
Videos of actions are complex signals containing rich compositional structure in space and time. We propose to represent the actions in a graph structure called Action Graph and present the new Action Graph To Video'' synthesis task. Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation.
arXiv Detail & Related papers (2020-06-27T09:39:04Z)
Zero-Shot Video Object Segmentation via Attentive Graph Neural Networks [150.5425122989146]
This work proposes a novel attentive graph neural network (AGNN) for zero-shot video object segmentation (ZVOS) AGNN builds a fully connected graph to efficiently represent frames as nodes, and relations between arbitrary frame pairs as edges. Experimental results on three video segmentation datasets show that AGNN sets a new state-of-the-art in each case.
arXiv Detail & Related papers (2020-01-19T10:45:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.