GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs
- URL: http://arxiv.org/abs/2502.11925v1
- Date: Mon, 17 Feb 2025 15:35:36 GMT
- Title: GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs
- Authors: Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, Jiawei Han,
- Abstract summary: Texts and images are usually interconnected, forming a multimodal attributed graph (MMAG)
It is underexplored how MLLMs can incorporate the relational information (textiti.e., graph structure) and semantic information (textiti.e. texts and images) on such graphs for multimodal comprehension and generation.
We propose GraphGPT-o, which supports omni-multimodal understanding and creation on MMAGs.
- Score: 34.076036577516895
- License:
- Abstract: The rapid development of Multimodal Large Language Models (MLLMs) has enabled the integration of multiple modalities, including texts and images, within the large language model (LLM) framework. However, texts and images are usually interconnected, forming a multimodal attributed graph (MMAG). It is underexplored how MLLMs can incorporate the relational information (\textit{i.e.}, graph structure) and semantic information (\textit{i.e.,} texts and images) on such graphs for multimodal comprehension and generation. In this paper, we propose GraphGPT-o, which supports omni-multimodal understanding and creation on MMAGs. We first comprehensively study linearization variants to transform semantic and structural information as input for MLLMs. Then, we propose a hierarchical aligner that enables deep graph encoding, bridging the gap between MMAGs and MLLMs. Finally, we explore the inference choices, adapting MLLM to interleaved text and image generation in graph scenarios. Extensive experiments on three datasets from different domains demonstrate the effectiveness of our proposed method. Datasets and codes will be open-sourced upon acceptance.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Rethinking Visual Prompting for Multimodal Large Language Models with External Knowledge [76.45868419402265]
multimodal large language models (MLLMs) have made significant strides by training on vast high-quality image-text datasets.
However, the inherent difficulty in explicitly conveying fine-grained or spatially dense information in text, such as masks, poses a challenge for MLLMs.
This paper proposes a new visual prompt approach to integrate fine-grained external knowledge, gleaned from specialized vision models, into MLLMs.
arXiv Detail & Related papers (2024-07-05T17:43:30Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs [77.86214400258473]
We propose a new training-free text-to-image generation/editing framework, namely Recaption, Plan and Generate (RPG)
RPG harnesses the powerful chain-of-thought reasoning ability of multimodal LLMs to enhance the compositionality of text-to-image diffusion models.
Our framework exhibits wide compatibility with various MLLM architectures.
arXiv Detail & Related papers (2024-01-22T06:16:29Z) - Large Language Models on Graphs: A Comprehensive Survey [77.16803297418201]
We provide a systematic review of scenarios and techniques related to large language models on graphs.
We first summarize potential scenarios of adopting LLMs on graphs into three categories, namely pure graphs, text-attributed graphs, and text-paired graphs.
We discuss the real-world applications of such methods and summarize open-source codes and benchmark datasets.
arXiv Detail & Related papers (2023-12-05T14:14:27Z) - Which Modality should I use -- Text, Motif, or Image? : Understanding Graphs with Large Language Models [14.251972223585765]
This paper introduces a new approach to encoding a graph with diverse modalities, such as text, image, and motif, and prompts to approximate a graph's global connectivity.
The study also presents GraphTMI, a novel benchmark for evaluating Large Language Models (LLMs) in graph structure analysis.
arXiv Detail & Related papers (2023-11-16T12:45:41Z) - Multimodal Graph Learning for Generative Tasks [89.44810441463652]
Multimodal learning combines multiple data modalities, broadening the types and complexity of data our models can utilize.
We propose Multimodal Graph Learning (MMGL), a framework for capturing information from multiple multimodal neighbors with relational structures among them.
arXiv Detail & Related papers (2023-10-11T13:25:03Z) - MMGA: Multimodal Learning with Graph Alignment [8.349066399479938]
We propose MMGA, a novel multimodal pre-training framework to incorporate information from graph (social network), image and text modalities on social media.
In MMGA, a multi-step graph alignment mechanism is proposed to add the self-supervision from graph modality to optimize the image and text encoders.
We release our dataset, the first social media multimodal dataset with graph, of 60,000 users labeled with specific topics based on 2 million posts to facilitate future research.
arXiv Detail & Related papers (2022-10-18T15:50:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.