LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision
- URL: http://arxiv.org/abs/2304.07647v7
- Date: Mon, 27 Oct 2025 20:14:22 GMT
- Title: LASER: A Neuro-Symbolic Framework for Learning Spatial-Temporal Scene Graphs with Weak Supervision
- Authors: Jiani Huang, Ziyang Li, Mayur Naik, Ser-Nam Lim,
- Abstract summary: We propose a neuro-symbolic framework to enable training generators using only video captions.<n>An alignment algorithm overcomes the challenges of weak supervision by leveraging a differentiable symbolic reasoner.<n>We evaluate our method on three video datasets: OpenPVSG, 20BN, and MUGEN.
- Score: 58.6039004982056
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Supervised approaches for learning spatio-temporal scene graphs (STSG) from video are greatly hindered due to their reliance on STSG-annotated videos, which are labor-intensive to construct at scale. Is it feasible to instead use readily available video captions as weak supervision? To address this question, we propose LASER, a neuro-symbolic framework to enable training STSG generators using only video captions. LASER employs large language models to first extract logical specifications with rich spatio-temporal semantic information from video captions. LASER then trains the underlying STSG generator to align the predicted STSG with the specification. The alignment algorithm overcomes the challenges of weak supervision by leveraging a differentiable symbolic reasoner and using a combination of contrastive, temporal, and semantics losses. The overall approach efficiently trains low-level perception models to extract a fine-grained STSG that conforms to the video caption. In doing so, it enables a novel methodology for learning STSGs without tedious annotations. We evaluate our method on three video datasets: OpenPVSG, 20BN, and MUGEN. Our approach demonstrates substantial improvements over fully-supervised baselines, achieving a unary predicate prediction accuracy of 27.78% (+12.65%) and a binary recall@5 of 0.42 (+0.22) on OpenPVSG. Additionally, LASER exceeds baselines by 7% on 20BN and 5.2% on MUGEN in terms of overall predicate prediction accuracy.
Related papers
- STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning [65.36458157092207]
In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations.<n>We propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities.<n>We introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization.
arXiv Detail & Related papers (2026-02-12T08:53:32Z) - Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention [28.598033369607723]
textscLight Forcing is a textitfirst sparse attention solution tailored for AR video generation models.<n>It incorporates a textitChunk-Aware Growth mechanism to quantitatively estimate the contribution of each chunk.<n>We also introduce a textit Sparse Attention to capture informative historical and local context in a coarse-to-fine manner.
arXiv Detail & Related papers (2026-02-04T17:41:53Z) - Entropy-Guided k-Guard Sampling for Long-Horizon Autoregressive Video Generation [22.973340187143616]
We propose Entropy-Guard k-gressive sampling, a strategy that adapts sampling to token-wise dispersion.<n> ENkG uses adaptive token candidate sizes for low-entropy regions, it employs fewer candidates to suppress redundant noise and preserve structural integrity.<n> Experiments demonstrate consistent improvements in perceptual quality and structural stability compared to static top-k/top-p strategies.
arXiv Detail & Related papers (2026-01-27T11:19:53Z) - ESCA: Contextualizing Embodied Agents via Scene-Graph Generation [47.008144510161486]
We propose ESCA, a framework that contextualizes embodied agents by grounding their perception in spatial-temporal scene graphs.<n>At its core is SGCLIP, a novel, open-domain, promptable foundation model for generating scene graphs.<n>SGCLIP excels in both prompt-based inference and task-specific fine-tuning, achieving state-of-the-art results on scene graph generation and action localization benchmarks.
arXiv Detail & Related papers (2025-10-11T20:13:59Z) - Enhancing Spectral Graph Neural Networks with LLM-Predicted Homophily [48.135717446964385]
Spectral Graph Neural Networks (SGNNs) have achieved remarkable performance in tasks such as node classification.<n>We propose a novel framework that leverages Large Language Models (LLMs) to estimate the homophily level of a graph.<n>Our framework consistently improves performance over strong SGNN baselines.
arXiv Detail & Related papers (2025-06-17T06:17:19Z) - Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next token prediction is the fundamental principle for training large language models (LLMs)
We introduce R1-SGG, a multimodal LLM (M-LLM) trained via supervised fine-tuning (SFT) on the scene graph dataset.
We design a graph-centric reward function that integrates node-level rewards, edge-level rewards, and a format consistency reward.
arXiv Detail & Related papers (2025-04-18T10:46:22Z) - Leveraging Joint Predictive Embedding and Bayesian Inference in Graph Self Supervised Learning [0.0]
Graph representation learning has emerged as a cornerstone for tasks like node classification and link prediction.
Current self-supervised learning (SSL) methods face challenges such as computational inefficiency, reliance on contrastive objectives, and representation collapse.
We propose a novel joint embedding predictive framework for graph SSL that eliminates contrastive objectives and negative sampling while preserving semantic and structural information.
arXiv Detail & Related papers (2025-02-02T07:42:45Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment [130.15775113897553]
Finsta is a fine-grained structural-temporal alignment learning method.
It consistently improves the existing 13 strong-tuning video-language models.
arXiv Detail & Related papers (2024-06-27T15:23:36Z) - OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition [8.18503795495178]
We prioritize the refinement of text knowledge to facilitate generalizable video recognition.
To address the limitations of the less distinct semantic space of category names, we prompt a large language model (LLM)
Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600.
arXiv Detail & Related papers (2023-11-30T13:32:43Z) - DynPoint: Dynamic Neural Point For View Synthesis [45.44096876841621]
We propose DynPoint, an algorithm designed to facilitate the rapid synthesis of novel views for unconstrained monocular videos.
DynPoint concentrates on predicting the explicit 3D correspondence between neighboring frames to realize information aggregation.
Our method exhibits strong robustness in handling long-duration videos without learning a canonical representation of video content.
arXiv Detail & Related papers (2023-10-29T12:55:53Z) - TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision.
We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder.
We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z) - Jointly Visual- and Semantic-Aware Graph Memory Networks for Temporal
Sentence Localization in Videos [67.12603318660689]
We propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN)
HVSARN enables both visual- and semantic-aware query reasoning from object-level to frame-level.
Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-02T08:00:22Z) - Algorithm and System Co-design for Efficient Subgraph-based Graph
Representation Learning [16.170895692951]
Subgraph-based graph representation learning (SGRL) has been recently proposed to deal with some fundamental challenges encountered by canonical graph neural networks (GNNs)
We propose a novel framework SUREL for scalable SGRL by co-designing the learning algorithm and its system support.
arXiv Detail & Related papers (2022-02-28T04:29:22Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for
Sign Language Translation [101.6042317204022]
Sign language translation (SLT) aims to interpret sign video sequences into text-based natural language sentences.
Existing SLT models usually represent sign visual features in a frame-wise manner.
We develop a novel hierarchical sign video feature learning method via a temporal semantic pyramid network, called TSPNet.
arXiv Detail & Related papers (2020-10-12T05:58:09Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.