Related papers: MMAPG: A Training-Free Framework for Multimodal Multi-hop Question Answering via Adaptive Planning Graphs

MMAPG: A Training-Free Framework for Multimodal Multi-hop Question Answering via Adaptive Planning Graphs

URL: http://arxiv.org/abs/2508.16051v2
Date: Fri, 19 Sep 2025 06:41:50 GMT
Title: MMAPG: A Training-Free Framework for Multimodal Multi-hop Question Answering via Adaptive Planning Graphs
Authors: Yiheng Hu, Xiaoyang Wang, Qing Liu, Xiwei Xu, Qian Fu, Wenjie Zhang, Liming Zhu,
Abstract summary: Multimodal question answering requires integrating information from diverse sources, such as images and texts, to derive answers.<n>Existing methods typically rely on sequential retrieval and reasoning, where each step builds on the previous output.<n>We propose a training-free framework guided by an Adaptive Planning Graph, which consists of planning, retrieval and reasoning modules.<n>Our approach preserves the characteristics of multimodal information without costly task-specific training, enabling seamless integration with up-to-date models.
Score: 20.03107299445341
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Multi-hop question answering requires integrating information from diverse sources, such as images and texts, to derive answers. Existing methods typically rely on sequential retrieval and reasoning, where each step builds on the previous output. However, this single-path paradigm makes them vulnerable to errors due to misleading intermediate steps. Moreover, developing multimodal models can be computationally expensive, often requiring extensive training. To address these limitations, we propose a training-free framework guided by an Adaptive Planning Graph, which consists of planning, retrieval and reasoning modules. The planning module analyzes the current state of the Adaptive Planning Graph, determines the next action and where to expand the graph, which enables dynamic and flexible exploration of reasoning paths. To handle retrieval of text to unspecified target modalities, we devise modality-specific strategies that dynamically adapt to distinct data types. Our approach preserves the characteristics of multimodal information without costly task-specific training, enabling seamless integration with up-to-date models. Finally, the experiments on MultimodalQA and WebQA show that our approach matches or outperforms existing models that rely on training.

Related papers

From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z)
MMhops-R1: Multimodal Multi-hop Reasoning [89.68086555694084]
We introduce MMhops, a novel benchmark designed to evaluate and foster multi-modal multi-hop reasoning.<n> MMhops dataset comprises two challenging task formats, Bridging and Comparison.<n>We propose MMhops-R1, a novel multi-modal Retrieval-Augmented Generation framework for dynamic reasoning.
arXiv Detail & Related papers (2025-12-15T17:29:02Z)
Plan Then Retrieve: Reinforcement Learning-Guided Complex Reasoning over Knowledge Graphs [52.16166558205338]
Graph-RFT is a novel two-stage reinforcement fine-tuning KGQA framework with a 'plan-KGsearch-and-Websearch-during-think' paradigm.<n>It enables LLMs to perform autonomous planning and adaptive retrieval scheduling across KG and web sources under incomplete knowledge conditions.
arXiv Detail & Related papers (2025-10-23T16:04:13Z)
Multimodal RAG Enhanced Visual Description [3.2771631221674333]
Pre-trained large multimodal models (LMMs) encounter a modality gap, characterised by a misalignment between textual and visual representations.<n>We propose a lightweight training-free approach utilising Retrieval-Augmented Generation (RAG) to extend across the modality.<n> Experimental results on two benchmark multimodal datasets demonstrate significant improvements.
arXiv Detail & Related papers (2025-08-06T19:04:38Z)
Anomaly Detection in Smart Power Grids with Graph-Regularized MS-SVDD: a Multimodal Subspace Learning Approach [14.794452134569474]
We address an anomaly detection problem in smart power grids using Multimodal Subspace Support Vector Data Description (MS-SVDD)<n>This approach aims to leverage better feature relations by considering the data as coming from different modalities.<n>We introduce novel multimodal graph-embedded regularizers that leverage graph information for every modality to enhance the training process.
arXiv Detail & Related papers (2025-02-18T16:47:54Z)
Multimodal Multihop Source Retrieval for Web Question Answering [0.0]
This work deals with the challenge of learning and reasoning over multi-modal multi-hop question answering (QA)<n>We propose a graph reasoning network based on the semantic structure of the sentences to learn multi-source reasoning paths.
arXiv Detail & Related papers (2025-01-07T22:53:56Z)
AQA: Adaptive Question Answering in a Society of LLMs via Contextual Multi-Armed Bandit [59.10281630985958]
In question answering (QA), different questions can be effectively addressed with different answering strategies. We develop a dynamic method that adaptively selects the most suitable QA strategy for each question. Our experiments show that the proposed solution is viable for adaptive orchestration of a QA system with multiple modules.
arXiv Detail & Related papers (2024-09-20T12:28:18Z)
A Practitioner's Guide to Continual Multimodal Pretraining [83.63894495064855]
Multimodal foundation models serve numerous applications at the intersection of vision and language.<n>To keep models updated, research into continual pretraining mainly explores scenarios with either infrequent, indiscriminate updates on large-scale new data, or frequent, sample-level updates.<n>We introduce FoMo-in-Flux, a continual multimodal pretraining benchmark with realistic compute constraints and practical deployment requirements.
arXiv Detail & Related papers (2024-08-26T17:59:01Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis. For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z)
A Survey of Graph Prompting Methods: Techniques, Applications, and Challenges [25.32529044997131]
"Pre-train, prompt, predict training" has gained popularity as a way to learn generalizable models with limited labeled data. The design of prompts could be a challenging and time-consuming process in complex tasks. This survey will bridge the gap between graphs and prompt design to facilitate future methodology development.
arXiv Detail & Related papers (2023-03-13T16:49:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.