ESCA: Contextualizing Embodied Agents via Scene-Graph Generation
- URL: http://arxiv.org/abs/2510.15963v2
- Date: Mon, 27 Oct 2025 17:51:21 GMT
- Title: ESCA: Contextualizing Embodied Agents via Scene-Graph Generation
- Authors: Jiani Huang, Amish Sethi, Matthew Kuo, Mayank Keoliya, Neelay Velingker, JungHo Jung, Ser-Nam Lim, Ziyang Li, Mayur Naik,
- Abstract summary: We propose ESCA, a framework that contextualizes embodied agents by grounding their perception in spatial-temporal scene graphs.<n>At its core is SGCLIP, a novel, open-domain, promptable foundation model for generating scene graphs.<n>SGCLIP excels in both prompt-based inference and task-specific fine-tuning, achieving state-of-the-art results on scene graph generation and action localization benchmarks.
- Score: 47.008144510161486
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal large language models (MLLMs) are making rapid progress toward general-purpose embodied agents. However, existing MLLMs do not reliably capture fine-grained links between low-level visual features and high-level textual semantics, leading to weak grounding and inaccurate perception. To overcome this challenge, we propose ESCA, a framework that contextualizes embodied agents by grounding their perception in spatial-temporal scene graphs. At its core is SGCLIP, a novel, open-domain, promptable foundation model for generating scene graphs that is based on CLIP. SGCLIP is trained on 87K+ open-domain videos using a neurosymbolic pipeline that aligns automatically generated captions with scene graphs produced by the model itself, eliminating the need for human-labeled annotations. We demonstrate that SGCLIP excels in both prompt-based inference and task-specific fine-tuning, achieving state-of-the-art results on scene graph generation and action localization benchmarks. ESCA with SGCLIP improves perception for embodied agents based on both open-source and commercial MLLMs, achieving state of-the-art performance across two embodied environments. Notably, ESCA significantly reduces agent perception errors and enables open-source models to surpass proprietary baselines. We release the source code for SGCLIP model training at https://github.com/video-fm/LASER and for the embodied agent at https://github.com/video-fm/ESCA.
Related papers
- TagaVLM: Topology-Aware Global Action Reasoning for Vision-Language Navigation [70.23578202012048]
Vision-Language Navigation (VLN) presents a unique challenge for Large Vision-Language Models (VLMs) due to their inherent architectural mismatch.<n>We propose TagaVLM (Topology-Aware Global Action reasoning), an end-to-end framework that explicitly injects topological structures into the VLM backbone.<n>To enhance topological node information, an Interleaved Navigation Prompt strengthens node-level visual-text alignment.<n>With the embedded topological graph, the model is capable of global action reasoning, allowing for robust path correction.
arXiv Detail & Related papers (2026-03-03T13:28:07Z) - Unleashing MLLMs on the Edge: A Unified Framework for Cross-Modal ReID via Adaptive SVD Distillation [48.88299242238335]
Cross-Modal Re-identification (CM-ReID) faces challenges due to maintaining a fragmented ecosystem of specialized cloud models.<n>We propose MLLMEmbed-ReID, a unified framework based on a powerful cloud-edge architecture.
arXiv Detail & Related papers (2026-02-13T13:48:08Z) - bi-modal textual prompt learning for vision-language models in remote sensing [23.747598435550504]
We present BiMoRS, a lightweight prompt learning framework tailored for remote sensing (RS) tasks.<n>BiMoRS employs a frozen image captioning model (e.g., BLIP-2) to extract semantic summaries from RS images.<n>A lightweight cross-attention module then conditions a learnable query prompt on the fused textual-visual representation, yielding contextualized prompts without altering the CLIP backbone.<n>We evaluate BiMoRS on four RS datasets across three domain generalization (DG) tasks and observe consistent performance gains, outperforming strong baselines by up to 2% on average.
arXiv Detail & Related papers (2026-01-28T14:58:14Z) - CyberLLM-FINDS 2025: Instruction-Tuned Fine-tuning of Domain-Specific LLMs with Retrieval-Augmented Generation and Graph Integration for MITRE Evaluation [0.054619385369457214]
This work presents a methodology to fine-tune the Gemma-2B model into a domain-specific cybersecurity LLM.<n>We detail the processes of dataset preparation, fine-tuning, and synthetic data generation, along with implications for real-world applications in threat detection, forensic investigation, and attack analysis.
arXiv Detail & Related papers (2026-01-11T05:07:57Z) - GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning [50.40400074353263]
Graph Neural Networks (GNNs) are powerful tools for precessing relational data but often struggle to generalize to unseen graphs.<n>We introduce textbfGraph textbfIn-context textbfL textbfTransformer (GILT), a framework built on an LLM-free and tuning-free architecture.
arXiv Detail & Related papers (2025-10-06T08:09:15Z) - GRIL: Knowledge Graph Retrieval-Integrated Learning with Large Language Models [59.72897499248909]
We propose a novel graph retriever trained end-to-end with Large Language Models (LLMs)<n>Within the extracted subgraph, structural knowledge and semantic features are encoded via soft tokens and the verbalized graph, respectively, which are infused into the LLM together.<n>Our approach consistently achieves state-of-the-art performance, validating the strength of joint graph-LLM optimization for complex reasoning tasks.
arXiv Detail & Related papers (2025-09-20T02:38:00Z) - Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z) - From Data to Modeling: Fully Open-vocabulary Scene Graph Generation [29.42202665594218]
OvSGTR is a transformer-based framework for fully open-vocabulary scene graph generation.<n>Our approach jointly predicts objects (nodes) and their inter-relationships (edges) beyond predefined categories.
arXiv Detail & Related papers (2025-05-26T15:11:23Z) - LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study [12.90392791734461]
Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks.<n>Recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene.<n>We introduce Text-Scene Graph (TSG) Bench, a benchmark designed to assess LLMs' ability to understand scene graphs.
arXiv Detail & Related papers (2025-05-26T04:45:12Z) - Scalability Matters: Overcoming Challenges in InstructGLM with Similarity-Degree-Based Sampling [1.2805157669888096]
We propose SDM-InstructGLM, a novel instruction-tuned Graph Language Model (InstructGLM) framework that enhances scalability and efficiency without relying on GNNs.<n>Our method introduces a similarity-degree-based biased random walk mechanism, which selectively samples and encodes graph information based on node-feature similarity and degree centrality.<n>Our results demonstrate the feasibility of LLM-only graph processing, enabling scalable and interpretable Graph Language Models (GLMs) optimized through instruction-based fine-tuning.
arXiv Detail & Related papers (2025-05-02T06:08:21Z) - GMLM: Bridging Graph Neural Networks and Language Models for Heterophilic Node Classification [0.0]
We propose a novel framework that enables effective fusion between pre-trained text encoders and Graph Convolutional Networks (R-GCNs)<n> Experiments on five heterophilic benchmarks demonstrate that our integration method achieves state-of-the-art results.<n>These results highlight the effectiveness of our fusion strategy for advancing text-rich graph representation learning.
arXiv Detail & Related papers (2025-02-24T07:44:01Z) - HIP: Hierarchical Point Modeling and Pre-training for Visual Information Extraction [24.46493675079128]
OCR-dependent methods rely on offline OCR engines, while OCR-free methods might produce outputs that lack interpretability or contain hallucinated content.
We propose HIP, which models entities as HIerarchical Points to better conform to the hierarchical nature of the end-to-end VIE task.
Specifically, such hierarchical points can be flexibly encoded and subsequently decoded into desired text transcripts, centers of various regions, and categories of entities.
arXiv Detail & Related papers (2024-11-02T05:00:13Z) - Bridging Large Language Models and Graph Structure Learning Models for Robust Representation Learning [22.993015048941444]
Graph representation learning is crucial for real-world applications but often encounters pervasive noise.
We introduce LangGSL, a framework that integrates the complementary strengths of pre-trained language models and graph structure learning models.
arXiv Detail & Related papers (2024-10-15T22:43:32Z) - A Pure Transformer Pretraining Framework on Text-attributed Graphs [50.833130854272774]
We introduce a feature-centric pretraining perspective by treating graph structure as a prior.
Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walks.
GSPT can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets.
arXiv Detail & Related papers (2024-06-19T22:30:08Z) - Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic
Role Labeling [96.64607294592062]
Video Semantic Label Roleing (VidSRL) aims to detect salient events from given videos.
Recent endeavors have put forth methods for VidSRL, but they can be subject to two key drawbacks.
arXiv Detail & Related papers (2023-08-09T17:20:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.