GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs
- URL: http://arxiv.org/abs/2502.11925v2
- Date: Sat, 08 Mar 2025 02:59:52 GMT
- Title: GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs
- Authors: Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, Jiawei Han,
- Abstract summary: Texts and images are usually interconnected, forming a multimodal attributed graph (MMAG)<n>It is underexplored how MLLMs can incorporate the relational information (textiti.e., graph structure) and semantic information (textiti.e. texts and images) on such graphs for multimodal comprehension and generation.<n>We propose GraphGPT-o, which supports omni-multimodal understanding and creation on MMAGs.
- Score: 34.076036577516895
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The rapid development of Multimodal Large Language Models (MLLMs) has enabled the integration of multiple modalities, including texts and images, within the large language model (LLM) framework. However, texts and images are usually interconnected, forming a multimodal attributed graph (MMAG). It is underexplored how MLLMs can incorporate the relational information (\textit{i.e.}, graph structure) and semantic information (\textit{i.e.,} texts and images) on such graphs for multimodal comprehension and generation. In this paper, we propose GraphGPT-o, which supports omni-multimodal understanding and creation on MMAGs. We first comprehensively study linearization variants to transform semantic and structural information as input for MLLMs. Then, we propose a hierarchical aligner that enables deep graph encoding, bridging the gap between MMAGs and MLLMs. Finally, we explore the inference choices, adapting MLLM to interleaved text and image generation in graph scenarios. Extensive experiments on three datasets from different domains demonstrate the effectiveness of our proposed method. Datasets and codes will be open-sourced upon acceptance.
Related papers
- Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.
Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.
A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z) - Aligning Vision to Language: Text-Free Multimodal Knowledge Graph Construction for Enhanced LLMs Reasoning [10.761218096540976]
Multimodal reasoning in Large Language Models (LLMs) struggles with incomplete knowledge and hallucination artifacts.
We propose Vision-align-to-Language integrated Knowledge Graph (VaLiK), a novel approach for constructing Multimodal Knowledge Graphs.
arXiv Detail & Related papers (2025-03-17T09:31:14Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs [77.86214400258473]
We propose a new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG)
RPG harnesses the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models.
Our framework exhibits wide compatibility with various MLLM architectures.
arXiv Detail & Related papers (2024-01-22T06:16:29Z) - Large Language Models on Graphs: A Comprehensive Survey [77.16803297418201]
We provide a systematic review of scenarios and techniques related to large language models on graphs.
We first summarize potential scenarios of adopting LLMs on graphs into three categories, namely pure graphs, text-attributed graphs, and text-paired graphs.
We discuss the real-world applications of such methods and summarize open-source codes and benchmark datasets.
arXiv Detail & Related papers (2023-12-05T14:14:27Z) - Which Modality should I use -- Text, Motif, or Image? : Understanding Graphs with Large Language Models [14.251972223585765]
This paper introduces a new approach to encoding a graph with diverse modalities, such as text, image, and motif, and prompts to approximate a graph's global connectivity.
The study also presents GraphTMI, a novel benchmark for evaluating Large Language Models (LLMs) in graph structure analysis.
arXiv Detail & Related papers (2023-11-16T12:45:41Z) - Multimodal Graph Learning for Generative Tasks [89.44810441463652]
Multimodal learning combines multiple data modalities, broadening the types and complexity of data our models can utilize.
We propose Multimodal Graph Learning (MMGL), a framework for capturing information from multiple multimodal neighbors with relational structures among them.
arXiv Detail & Related papers (2023-10-11T13:25:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.