Related papers: Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images

Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images

URL: http://arxiv.org/abs/2510.21828v1
Date: Wed, 22 Oct 2025 02:23:40 GMT
Title: Structured and Abstractive Reasoning on Multi-modal Relational Knowledge Images
Authors: Yichi Zhang, Zhuo Chen, Lingbing Guo, Lei Liang, Wen Zhang, Huajun Chen,
Abstract summary: This paper bridges the dual gaps in large-scale high-quality data and capability enhancement methodologies.<n>We introduce STAR-64K, a dataset comprising 64K high-quality multi-modal instruction samples, and conduct experiments across 5 open-source MLLMs.
Score: 58.553448128258566
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding and reasoning with abstractive information from the visual modality presents significant challenges for current multi-modal large language models (MLLMs). Among the various forms of abstractive information, Multi-Modal Relational Knowledge (MMRK), which represents abstract relational structures between multi-modal entities using node-edge formats, remains largely under-explored. In particular, STructured and Abstractive Reasoning (STAR) on such data has received little attention from the research community. To bridge the dual gaps in large-scale high-quality data and capability enhancement methodologies, this paper makes the following key contributions: (i). An automatic STAR data engine capable of synthesizing images with MMRK to build multi-modal instruction data with reliable chain-of-thought thinking for various STAR tasks and (ii). A comprehsive two-stage capability enhancement training framework, accompanied by a suite of evaluation protocols tailored to different STAR tasks. Based upon these contributions, we introduce STAR-64K, a dataset comprising 64K high-quality multi-modal instruction samples, and conduct experiments across 5 open-source MLLMs. Experimental results show that our two-stage enhancement framework enables smaller 3B/7B models to significantly outperform GPT-4o in STAR. Additionally, we provide in-depth analysis regarding the effectiveness of various designs, data transferability, and scalability.

Related papers

Graph4MM: Weaving Multimodal Learning with Structural Information [52.16646463590474]
Graphs provide powerful structural information for modeling intra- and inter-modal relationships.<n>Previous works fail to distinguish multi-hop neighbors and treat the graph as a standalone modality.<n>We propose Graph4MM, a graph-based multimodal learning framework.
arXiv Detail & Related papers (2025-10-19T20:13:03Z)
Complementarity-driven Representation Learning for Multi-modal Knowledge Graph Completion [0.0]
We propose a novel framework named Mixture of Complementary Modality Experts (MoCME)<n>MoCME consists of a Complementarity-guided Modality Knowledge Fusion (CMKF) module and an Entropy-guided Negative Sampling (EGNS) mechanism.<n>Our MoCME achieves state-of-the-art performance, surpassing existing approaches.
arXiv Detail & Related papers (2025-07-28T08:35:11Z)
Multi-modal Retrieval Augmented Multi-modal Generation: Datasets, Evaluation Metrics and Strong Baselines [63.22096609916707]
Multi-modal Retrieval Augmented Multi-modal Generation (M$2$RAG) is a novel task that enables foundation models to process multi-modal web content.<n>Despite its potential impact, M$2$RAG remains understudied, lacking comprehensive analysis and high-quality data resources.
arXiv Detail & Related papers (2024-11-25T13:20:19Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields.<n>We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation.<n>Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
IWISDM: Assessing instruction following in multimodal models at scale [1.2320972303448239]
We introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models.
arXiv Detail & Related papers (2024-06-20T14:09:54Z)
Multiple Heads are Better than One: Mixture of Modality Knowledge Experts for Entity Representation Learning [51.80447197290866]
Learning high-quality multi-modal entity representations is an important goal of multi-modal knowledge graph (MMKG) representation learning.<n>Existing methods focus on crafting elegant entity-wise multi-modal fusion strategies.<n>We introduce a novel framework with Mixture of Modality Knowledge experts (MoMoK) to learn adaptive multi-modal entity representations.
arXiv Detail & Related papers (2024-05-27T06:36:17Z)
Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts [54.529880848937104]
We develop a unified MLLM with the MoE architecture, named Uni-MoE, that can handle a wide array of modalities. Specifically, it features modality-specific encoders with connectors for a unified multimodal representation. We evaluate the instruction-tuned Uni-MoE on a comprehensive set of multimodal datasets.
arXiv Detail & Related papers (2024-05-18T12:16:01Z)
Multi-modal Semantic Understanding with Contrastive Cross-modal Feature Alignment [11.897888221717245]
This paper proposes a novel CLIP-guided contrastive-learning-based architecture to perform multi-modal feature alignment. Our model is simple to implement without using task-specific external knowledge, and thus can easily migrate to other multi-modal tasks.
arXiv Detail & Related papers (2024-03-11T01:07:36Z)
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models. Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z)
MESED: A Multi-modal Entity Set Expansion Dataset with Fine-grained Semantic Classes and Hard Negative Entities [25.059177235004952]
We propose Multi-modal Entity Set Expansion (MESE), where models integrate information from multiple modalities to represent entities. A powerful multi-modal model MultiExpan is proposed which is pre-trained on four multimodal pre-training tasks. The MESED dataset is the first multi-modal dataset for ESE with large-scale and elaborate manual calibration.
arXiv Detail & Related papers (2023-07-27T14:09:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.