COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models
- URL: http://arxiv.org/abs/2510.11012v1
- Date: Mon, 13 Oct 2025 05:07:13 GMT
- Title: COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models
- Authors: Sanchit Sinha, Guangzhi Xiong, Aidong Zhang,
- Abstract summary: 'COCO-Tree' is a novel approach that augments VLM outputs with carefully designed neurosymbolic concept trees to improve VLM's linguistic reasoning.<n>COCO-Tree's beam search-inspired reasoning process boosts compositionality performance and provides a rationale behind VLM predictions.
- Score: 45.48194499967696
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compositional reasoning remains a persistent weakness of modern vision language models (VLMs): they often falter when a task hinges on understanding how multiple objects, attributes, and relations interact within an image. Multiple research works have attempted to improve compositionality performance by creative tricks such as improving prompt structure, chain of thought reasoning, etc. A more recent line of work attempts to impart additional reasoning in VLMs using well-trained Large Language Models (LLMs), which are far superior in linguistic understanding than VLMs to compensate for the limited linguistic prowess of VLMs. However, these approaches are either resource-intensive or do not provide an interpretable reasoning process. In this paper, we present 'COCO-Tree' - a novel approach that augments VLM outputs with carefully designed neurosymbolic concept trees learned from LLMs to improve VLM's linguistic reasoning. COCO-Tree's beam search-inspired reasoning process boosts compositionality performance and provides a rationale behind VLM predictions. Empirical results on four compositionality benchmarks, Winoground, EqBench, ColorSwap, and SugarCrepe, in seven different open-source VLMs with varying sizes, demonstrate that COCO-Tree significantly improves compositional generalization by 5-10% over baselines.
Related papers
- Decomposing Visual Classification: Assessing Tree-Based Reasoning in VLMs [1.4231678631753704]
Vision language models (VLMs) excel at zero-shot visual classification, but their performance on fine-grained tasks and large hierarchical label spaces is understudied.<n>This paper investigates whether structured, tree-based reasoning can enhance VLM performance.
arXiv Detail & Related papers (2025-09-10T13:08:03Z) - Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger [51.01841635655944]
Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks.<n>Existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge.<n>We propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method.
arXiv Detail & Related papers (2025-06-09T14:00:57Z) - Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving [61.992824291296444]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z) - CausalVLBench: Benchmarking Visual Causal Reasoning in Large Vision-Language Models [14.852783307793525]
Large vision-language models (LVLMs) have shown impressive performance in tasks such as recognition and visual question answering.<n>We introduce a comprehensive causal reasoning benchmark for multi-modal in-context learning from LVLMs.<n>We evaluate the ability of state-of-the-art open-source LVLMs on our causal reasoning tasks across three causal representation learning datasets.
arXiv Detail & Related papers (2025-05-21T00:45:15Z) - XCOMPS: A Multilingual Benchmark of Conceptual Minimal Pairs [43.45666129711046]
XCOMPS is a multilingual conceptual minimal pair dataset covering 17 languages.<n>We evaluate LLMs' multilingual conceptual understanding through metalinguistic prompting, direct probability measurement, and neurolinguistic probing.
arXiv Detail & Related papers (2025-02-27T04:02:13Z) - MMCOMPOSITION: Revisiting the Compositionality of Pre-trained Vision-Language Models [85.10375181040436]
We propose MMCOMPOSITION, a novel human-annotated benchmark for comprehensively and accurately evaluating Vision-Language Models.
We find GPT-4o's compositionality inferior to the best open-source model, and we analyze the underlying reasons.
arXiv Detail & Related papers (2024-10-13T05:35:09Z) - Enhancing Advanced Visual Reasoning Ability of Large Language Models [20.32900494896848]
Recent advancements in Vision-Language (VL) research have sparked new benchmarks for complex visual reasoning.
We propose Complex Visual Reasoning Large Language Models (CVR-LLM)
Our approach transforms images into detailed, context-aware descriptions using an iterative self-refinement loop.
We also introduce a novel multi-modal in-context learning (ICL) methodology to enhance LLMs' contextual understanding and reasoning.
arXiv Detail & Related papers (2024-09-21T02:10:19Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - 3VL: Using Trees to Improve Vision-Language Models' Interpretability [40.678288227161936]
Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks.<n>These representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects.<n>In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool.
arXiv Detail & Related papers (2023-12-28T20:26:03Z) - CoVLM: Composing Visual Entities and Relationships in Large Language
Models Via Communicative Decoding [66.52659447360104]
CoVLM can guide the LLM to explicitly compose visual entities and relationships among the text.
We propose CoVLM, which can guide the LLM to explicitly compose visual entities and relationships among the text.
arXiv Detail & Related papers (2023-11-06T18:59:44Z) - Large Language Models are Visual Reasoning Coordinators [144.67558375045755]
We propose a novel paradigm that coordinates multiple vision-language models for visual reasoning.
We show that our instruction tuning variant, Cola-FT, achieves state-of-the-art performance on visual question answering.
We also show that our in-context learning variant, Cola-Zero, exhibits competitive performance in zero and few-shot settings.
arXiv Detail & Related papers (2023-10-23T17:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.