Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models
- URL: http://arxiv.org/abs/2603.00431v1
- Date: Sat, 28 Feb 2026 03:13:15 GMT
- Title: Taxonomy-Aware Representation Alignment for Hierarchical Visual Recognition with Large Multimodal Models
- Authors: Hulingxiao He, Zhi Tan, Yuxin Peng,
- Abstract summary: A high-performing, general-purpose visual understanding model should map visual inputs to a taxonomic tree of labels.<n>We propose taxonomy-Aware Representation Alignment (TARA), a simple yet effective strategy to inject taxonomic knowledge into LMMs.<n>TARA consistently enhances LMMs' hierarchical consistency and leaf node accuracy, enabling reliable recognition of both known and novel categories.
- Score: 47.868429337792314
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A high-performing, general-purpose visual understanding model should map visual inputs to a taxonomic tree of labels, identify novel categories beyond the training set for which few or no publicly available images exist. Large Multimodal Models (LMMs) have achieved remarkable progress in fine-grained visual recognition (FGVR) for known categories. However, they remain limited in hierarchical visual recognition (HVR) that aims at predicting consistent label paths from coarse to fine categories, especially for novel categories. To tackle these challenges, we propose Taxonomy-Aware Representation Alignment (TARA), a simple yet effective strategy to inject taxonomic knowledge into LMMs. TARA leverages representations from biology foundation models (BFMs) that encode rich biological relationships through hierarchical contrastive learning. By aligning the intermediate representations of visual features with those of BFMs, LMMs are encouraged to extract discriminative visual cues well structured in the taxonomy tree. Additionally, we align the representations of the first answer token with the ground-truth label, flexibly bridging the gap between contextualized visual features and categories of varying granularity according to user intent. Experiments demonstrate that TARA consistently enhances LMMs' hierarchical consistency and leaf node accuracy, enabling reliable recognition of both known and novel categories within complex biological taxonomies. Code is available at https://github.com/PKU-ICST-MIPL/TARA_CVPR2026.
Related papers
- Generalized Fine-Grained Category Discovery with Multi-Granularity Conceptual Experts [81.68203255687051]
Generalized Category Discovery is an open-world problem that clusters unlabeled data by leveraging knowledge from partially labeled categories.<n>Existing approaches fail to exploit multi-granularity conceptual information in visual data.<n>We propose a Multi-Granularity Experts framework that integrates multi-granularity knowledge for accurate category discovery.
arXiv Detail & Related papers (2025-09-30T13:25:11Z) - Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model [52.01031460230826]
Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms.<n>Recent research has demonstrated that combining large language models with vision-language models (VLMs) makes open-set recognition possible.<n>We propose our training-free method, Enriched-FineR, which demonstrates state-of-the-art results in fine-grained visual recognition.
arXiv Detail & Related papers (2025-07-30T20:06:01Z) - Cross-Hierarchical Bidirectional Consistency Learning for Fine-Grained Visual Classification [1.757077789361314]
We propose a novel Cross-Hierarchical Bidirectional Consistency Learning (CHBC) framework to improve classification accuracy and consistency.<n>The CHBC framework extracts discriminative features across various hierarchies using a specially designed module to decompose and enhance attention masks.<n> Experiments on three widely used FGVC datasets validate the effectiveness of the CHBC framework.
arXiv Detail & Related papers (2025-04-18T10:30:17Z) - Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels [19.740929527669483]
Multi-label recognition with partial labels (MLR-PL) is a practical task in computer vision.<n>We introduce a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework.<n>Our method effectively separates information from different categories and achieves better performance compared to CLIP-based baseline method.
arXiv Detail & Related papers (2024-12-14T14:31:36Z) - Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
We propose a novel category-adaptive cross-modal semantic refinement and transfer (C$2$SRT) framework to explore the semantic correlation.<n>The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module.<n>Experiments on OV-MLR benchmarks clearly demonstrate that the proposed C$2$SRT framework outperforms current state-of-the-art algorithms.
arXiv Detail & Related papers (2024-12-09T04:00:18Z) - Data-free Multi-label Image Recognition via LLM-powered Prompt Tuning [23.671999163027284]
This paper proposes a novel framework for multi-label image recognition without any training data.
It uses knowledge of pre-trained Large Language Model to learn prompts to adapt pretrained Vision-Language Model like CLIP to multilabel classification.
Our framework presents a new way to explore the synergies between multiple pre-trained models for novel category recognition.
arXiv Detail & Related papers (2024-03-02T13:43:32Z) - Semantic Representation and Dependency Learning for Multi-Label Image
Recognition [76.52120002993728]
We propose a novel and effective semantic representation and dependency learning (SRDL) framework to learn category-specific semantic representation for each category.
Specifically, we design a category-specific attentional regions (CAR) module to generate channel/spatial-wise attention matrices to guide model.
We also design an object erasing (OE) module to implicitly learn semantic dependency among categories by erasing semantic-aware regions.
arXiv Detail & Related papers (2022-04-08T00:55:15Z) - Knowledge-Guided Multi-Label Few-Shot Learning for General Image
Recognition [75.44233392355711]
KGGR framework exploits prior knowledge of statistical label correlations with deep neural networks.
It first builds a structured knowledge graph to correlate different labels based on statistical label co-occurrence.
Then, it introduces the label semantics to guide learning semantic-specific features.
It exploits a graph propagation network to explore graph node interactions.
arXiv Detail & Related papers (2020-09-20T15:05:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.