Composition-Grounded Instruction Synthesis for Visual Reasoning
- URL: http://arxiv.org/abs/2510.15040v1
- Date: Thu, 16 Oct 2025 18:00:48 GMT
- Title: Composition-Grounded Instruction Synthesis for Visual Reasoning
- Authors: Xinyi Gu, Jiayuan Mao, Zhang-Wei Hong, Zhuoran Yu, Pengyuan Li, Dhiraj Joshi, Rogerio Feris, Zexue He,
- Abstract summary: COGS is a framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions.<n>We show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions.
- Score: 29.72969046915743
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pretrained multi-modal large language models (MLLMs) demonstrate strong performance on diverse multimodal tasks, but remain limited in reasoning capabilities for domains where annotations are difficult to collect. In this work, we focus on artificial image domains such as charts, rendered documents, and webpages, which are abundant in practice yet lack large-scale human annotated reasoning datasets. We introduce COGS (COmposition-Grounded instruction Synthesis), a data-efficient framework for equipping MLLMs with advanced reasoning abilities from a small set of seed questions. The key idea is to decompose each seed question into primitive perception and reasoning factors, which can then be systematically recomposed with new images to generate large collections of synthetic question-answer pairs. Each generated question is paired with subquestions and intermediate answers, enabling reinforcement learning with factor-level process rewards. Experiments on chart reasoning show that COGS substantially improves performance on unseen questions, with the largest gains on reasoning-heavy and compositional questions. Moreover, training with a factor-level mixture of different seed data yields better transfer across multiple datasets, suggesting that COGS induces generalizable capabilities rather than dataset-specific overfitting. We further demonstrate that the framework extends beyond charts to other domains such as webpages.
Related papers
- Do Reasoning Models Ask Better Questions? A Formal Information-Theoretic Analysis on Multi-Turn LLM Games [0.0]
Large Language Models (LLMs) excel at many tasks but struggle with a critical ability for resolving ambiguity in user requests.<n>We propose a multi-turn dialogue framework that quantitatively measures how effectively LLMs gather information through yes/no questions.<n>Our experiments demonstrate that, among the evaluated models, the ones with explicit reasoning capabilities achieve higher IG per turn and reach solutions in fewer steps.
arXiv Detail & Related papers (2026-01-25T06:38:15Z) - S-DAG: A Subject-Based Directed Acyclic Graph for Multi-Agent Heterogeneous Reasoning [22.610518752034043]
Large Language Models (LLMs) have achieved impressive performance in complex reasoning problems.<n>This work proposes a novel framework that performs fine-grained analysis at subject level equipped with a designated multi-agent collaboration strategy.
arXiv Detail & Related papers (2025-11-10T05:40:02Z) - MIXRAG : Mixture-of-Experts Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering [6.596018318578605]
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge sources during inference.<n>Most existing approaches rely on a single retriever to identify relevant subgraphs, which limits their ability to capture diverse aspects of complex queries.<n>We propose MIXRAG, a Mixture-of-Experts Graph-RAG framework that introduces multiple specialized graph retrievers and a dynamic routing controller.
arXiv Detail & Related papers (2025-09-24T02:44:57Z) - GRIL: Knowledge Graph Retrieval-Integrated Learning with Large Language Models [59.72897499248909]
We propose a novel graph retriever trained end-to-end with Large Language Models (LLMs)<n>Within the extracted subgraph, structural knowledge and semantic features are encoded via soft tokens and the verbalized graph, respectively, which are infused into the LLM together.<n>Our approach consistently achieves state-of-the-art performance, validating the strength of joint graph-LLM optimization for complex reasoning tasks.
arXiv Detail & Related papers (2025-09-20T02:38:00Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z) - What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices [91.71951459594074]
Long language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios.<n>Existing methods typically utilize the Self-Instruct framework to generate instruction tuning data for better long context capability improvement.<n>We propose the Multi-agent Interactive Multi-hop Generation framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent.<n>Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human
arXiv Detail & Related papers (2024-09-03T13:30:00Z) - Optimizing Language Model's Reasoning Abilities with Weak Supervision [48.60598455782159]
We present textscPuzzleBen, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales.
A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities.
arXiv Detail & Related papers (2024-05-07T07:39:15Z) - Enhancing Human-like Multi-Modal Reasoning: A New Challenging Dataset
and Comprehensive Framework [51.44863255495668]
Multimodal reasoning is a critical component in the pursuit of artificial intelligence systems that exhibit human-like intelligence.
We present Multi-Modal Reasoning(COCO-MMR) dataset, a novel dataset that encompasses an extensive collection of open-ended questions.
We propose innovative techniques, including multi-hop cross-modal attention and sentence-level contrastive learning, to enhance the image and text encoders.
arXiv Detail & Related papers (2023-07-24T08:58:25Z) - PGX: A Multi-level GNN Explanation Framework Based on Separate Knowledge
Distillation Processes [0.2005299372367689]
We propose a multi-level GNN explanation framework based on an observation that GNN is a multimodal learning process of multiple components in graph data.
The complexity of the original problem is relaxed by breaking into multiple sub-parts represented as a hierarchical structure.
We also aim for personalized explanations as the framework can generate different results based on user preferences.
arXiv Detail & Related papers (2022-08-05T10:14:48Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Feature Ranking for Semi-supervised Learning [3.1380888953704984]
We propose semi-supervised learning of feature ranking.
To the best of our knowledge, this is the first work that treats the task of feature ranking within the semi-supervised structured output prediction context.
The evaluation across 38 benchmark datasets reveals the following: Random Forests perform the best for the classification-like tasks, while for the regression-like tasks Extra-PCTs perform the best.
arXiv Detail & Related papers (2020-08-10T07:50:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.