Compile Scene Graphs with Reinforcement Learning
- URL: http://arxiv.org/abs/2504.13617v2
- Date: Sun, 27 Apr 2025 11:12:54 GMT
- Title: Compile Scene Graphs with Reinforcement Learning
- Authors: Zuyao Chen, Jinlin Wu, Zhen Lei, Marc Pollefeys, Chang Wen Chen,
- Abstract summary: Next token prediction is the fundamental principle for training large language models (LLMs)<n>We introduce R1-SGG, a multimodal LLM (M-LLM) trained via supervised fine-tuning (SFT) on the scene graph dataset.<n>We design a graph-centric reward function that integrates node-level rewards, edge-level rewards, and a format consistency reward.
- Score: 69.36723767339001
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Next token prediction is the fundamental principle for training large language models (LLMs), and reinforcement learning (RL) further enhances their reasoning performance. As an effective way to model language, image, video, and other modalities, the use of LLMs for end-to-end extraction of structured visual representations, such as scene graphs, remains underexplored. It requires the model to accurately produce a set of objects and relationship triplets, rather than generating text token by token. To achieve this, we introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset and subsequently refined using reinforcement learning to enhance its ability to generate scene graphs in an end-to-end manner. The SFT follows a conventional prompt-response paradigm, while RL requires the design of effective reward signals. Given the structured nature of scene graphs, we design a graph-centric reward function that integrates node-level rewards, edge-level rewards, and a format consistency reward. Our experiments demonstrate that rule-based RL substantially enhances model performance in the SGG task, achieving a zero failure rate--unlike supervised fine-tuning (SFT), which struggles to generalize effectively. Our code is available at https://github.com/gpt4vision/R1-SGG.
Related papers
- PRISM-0: A Predicate-Rich Scene Graph Generation Framework for Zero-Shot Open-Vocabulary Tasks [51.31903029903904]
In Scene Graphs Generation (SGG) one extracts structured representation from visual inputs in the form of objects nodes and predicates connecting them.<n> PRISM-0 is a framework for zero-shot open-vocabulary SGG that bootstraps foundation models in a bottom-up approach.<n> PRIMS-0 generates semantically meaningful graphs that improve downstream tasks such as Image Captioning and Sentence-to-Graph Retrieval.
arXiv Detail & Related papers (2025-04-01T14:29:51Z) - LLMs as Zero-shot Graph Learners: Alignment of GNN Representations with LLM Token Embeddings [7.302176015732192]
We introduce a novel framework named Token Embedding-Aligned Graph Language Model (TEA-GLM)<n>We pretrain a GNN, aligning its representations with token embeddings of an LLM.<n>We then train a linear projector that transforms the GNN's representations into a fixed number of graph token embeddings.
arXiv Detail & Related papers (2024-08-25T04:32:45Z) - Parameter-Efficient Tuning Large Language Models for Graph Representation Learning [62.26278815157628]
We introduce Graph-aware.
Efficient Fine-Tuning - GPEFT, a novel approach for efficient graph representation learning.
We use a graph neural network (GNN) to encode structural information from neighboring nodes into a graph prompt.
We validate our approach through comprehensive experiments conducted on 8 different text-rich graphs, observing an average improvement of 2% in hit@1 and Mean Reciprocal Rank (MRR) in link prediction evaluations.
arXiv Detail & Related papers (2024-04-28T18:36:59Z) - GraSAME: Injecting Token-Level Structural Information to Pretrained Language Models via Graph-guided Self-Attention Mechanism [10.573861741540853]
We propose a graph-guided self-attention mechanism, GraSAME, for pretrained language models.
GraSAME seamlessly incorporates token-level structural information into PLMs without necessitating additional alignment or concatenation efforts.
Our experiments on the graph-to-text generation task demonstrate that GraSAME outperforms baseline models and achieves results comparable to state-of-the-art (SOTA) models on WebNLG datasets.
arXiv Detail & Related papers (2024-04-10T11:03:57Z) - Chasing Fairness in Graphs: A GNN Architecture Perspective [73.43111851492593]
We propose textsfFair textsfMessage textsfPassing (FMP) designed within a unified optimization framework for graph neural networks (GNNs)
In FMP, the aggregation is first adopted to utilize neighbors' information and then the bias mitigation step explicitly pushes demographic group node presentation centers together.
Experiments on node classification tasks demonstrate that the proposed FMP outperforms several baselines in terms of fairness and accuracy on three real-world datasets.
arXiv Detail & Related papers (2023-12-19T18:00:15Z) - SimTeG: A Frustratingly Simple Approach Improves Textual Graph Learning [131.04781590452308]
We present SimTeG, a frustratingly Simple approach for Textual Graph learning.
We first perform supervised parameter-efficient fine-tuning (PEFT) on a pre-trained LM on the downstream task.
We then generate node embeddings using the last hidden states of finetuned LM.
arXiv Detail & Related papers (2023-08-03T07:00:04Z) - Graph Ordering: Towards the Optimal by Learning [69.72656588714155]
Graph representation learning has achieved a remarkable success in many graph-based applications, such as node classification, prediction, and community detection.
However, for some kind of graph applications, such as graph compression and edge partition, it is very hard to reduce them to some graph representation learning tasks.
In this paper, we propose to attack the graph ordering problem behind such applications by a novel learning approach.
arXiv Detail & Related papers (2020-01-18T09:14:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.