Decomposing Visual Classification: Assessing Tree-Based Reasoning in VLMs
- URL: http://arxiv.org/abs/2509.09732v1
- Date: Wed, 10 Sep 2025 13:08:03 GMT
- Title: Decomposing Visual Classification: Assessing Tree-Based Reasoning in VLMs
- Authors: Sary Elmansoury, Islam Mesabah, Gerrit Großmann, Peter Neigel, Raj Bhalwankar, Daniel Kondermann, Sebastian J. Vollmer,
- Abstract summary: Vision language models (VLMs) excel at zero-shot visual classification, but their performance on fine-grained tasks and large hierarchical label spaces is understudied.<n>This paper investigates whether structured, tree-based reasoning can enhance VLM performance.
- Score: 1.4231678631753704
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision language models (VLMs) excel at zero-shot visual classification, but their performance on fine-grained tasks and large hierarchical label spaces is understudied. This paper investigates whether structured, tree-based reasoning can enhance VLM performance. We introduce a framework that decomposes classification into interpretable decisions using decision trees and evaluates it on fine-grained (GTSRB) and coarse-grained (CIFAR-10) datasets. Although the model achieves 98.2% accuracy in understanding the tree knowledge, tree-based reasoning consistently underperforms standard zero-shot prompting. We also explore enhancing the tree prompts with LLM-generated classes and image descriptions to improve alignment. The added description enhances the performance of the tree-based and zero-shot methods. Our findings highlight limitations of structured reasoning in visual classification and offer insights for designing more interpretable VLM systems.
Related papers
- Learning Order Forest for Qualitative-Attribute Data Clustering [52.612779710298526]
This paper discovers a tree-like distance structure to flexibly represent the local order relationship among intra-attribute qualitative values.<n>A joint learning mechanism is proposed to iteratively obtain more appropriate tree structures and clusters.<n>Experiments demonstrate that the joint learning adapts the forest to the clustering task to yield accurate results.
arXiv Detail & Related papers (2026-03-03T07:49:50Z) - VCU-Bridge: Hierarchical Visual Connotation Understanding via Semantic Bridging [49.55286536996476]
We present VCU-Bridge, a framework that operationalizes a human-like hierarchy of visual connotation understanding.<n>Building on this framework, we construct HVCU-Bench, a benchmark for hierarchical visual connotation understanding with explicit, level-wise diagnostics.
arXiv Detail & Related papers (2025-11-22T17:01:03Z) - COCO-Tree: Compositional Hierarchical Concept Trees for Enhanced Reasoning in Vision Language Models [45.48194499967696]
'COCO-Tree' is a novel approach that augments VLM outputs with carefully designed neurosymbolic concept trees to improve VLM's linguistic reasoning.<n>COCO-Tree's beam search-inspired reasoning process boosts compositionality performance and provides a rationale behind VLM predictions.
arXiv Detail & Related papers (2025-10-13T05:07:13Z) - ReTreever: Tree-based Coarse-to-Fine Representations for Retrieval [64.44265315244579]
We propose a tree-based method for organizing and representing reference documents at various granular levels.<n>Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches.<n>Our evaluations show that ReTreever generally preserves full representation accuracy.
arXiv Detail & Related papers (2025-02-11T21:35:13Z) - Zero-Shot Decision Tree Construction via Large Language Models [2.005837558796176]
We introduce an algorithm for constructing decision trees using large language models (LLMs) in a zero-shot manner based on Classification and Regression Trees (CART) principles.<n>Our approach leverages LLMs to perform operations essential for decision tree construction, including attribute discretization, probability calculation, and Gini index computation.
arXiv Detail & Related papers (2025-01-27T17:48:48Z) - GPT-HTree: A Decision Tree Framework Integrating Hierarchical Clustering and Large Language Models for Explainable Classification [0.0]
GPT-HTree is a framework combining hierarchical clustering, decision trees, and large language models (LLMs)<n>LLMs enhance the framework by generating human-readable cluster descriptions, bridging quantitative analysis with actionable insights.
arXiv Detail & Related papers (2025-01-23T15:18:22Z) - Language Model as Visual Explainer [72.88137795439407]
We present a systematic approach for interpreting vision models using a tree-structured linguistic explanation.<n>Our method provides human-understandable explanations in the form of attribute-laden trees.<n>To access the effectiveness of our approach, we introduce new benchmarks and conduct rigorous evaluations.
arXiv Detail & Related papers (2024-12-08T20:46:23Z) - Optimized Feature Generation for Tabular Data via LLMs with Decision Tree Reasoning [53.241569810013836]
We propose a novel framework that utilizes large language models (LLMs) to identify effective feature generation rules.
We use decision trees to convey this reasoning information, as they can be easily represented in natural language.
OCTree consistently enhances the performance of various prediction models across diverse benchmarks.
arXiv Detail & Related papers (2024-06-12T08:31:34Z) - 3VL: Using Trees to Improve Vision-Language Models' Interpretability [40.678288227161936]
Vision-Language models (VLMs) have proven to be effective at aligning image and text representations, producing superior zero-shot results when transferred to many downstream tasks.<n>These representations suffer from some key shortcomings in understanding Compositional Language Concepts (CLC), such as recognizing objects' attributes, states, and relations between different objects.<n>In this work, we introduce the architecture and training technique of Tree-augmented Vision-Language (3VL) model accompanied by our proposed Anchor inference method and Differential Relevance (DiRe) interpretability tool.
arXiv Detail & Related papers (2023-12-28T20:26:03Z) - Waffling around for Performance: Visual Classification with Random Words
and Broad Concepts [121.60918966567657]
WaffleCLIP is a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors.
We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors.
arXiv Detail & Related papers (2023-06-12T17:59:48Z) - Linguistically Driven Graph Capsule Network for Visual Question
Reasoning [153.76012414126643]
We propose a hierarchical compositional reasoning model called the "Linguistically driven Graph Capsule Network"
The compositional process is guided by the linguistic parse tree. Specifically, we bind each capsule in the lowest layer to bridge the linguistic embedding of a single word in the original question with visual evidence.
Experiments on the CLEVR dataset, CLEVR compositional generation test, and FigureQA dataset demonstrate the effectiveness and composition generalization ability of our end-to-end model.
arXiv Detail & Related papers (2020-03-23T03:34:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.