Related papers: Hierarchy-Aware Fine-Tuning of Vision-Language Models

Hierarchy-Aware Fine-Tuning of Vision-Language Models

URL: http://arxiv.org/abs/2512.21529v1
Date: Thu, 25 Dec 2025 06:44:33 GMT
Title: Hierarchy-Aware Fine-Tuning of Vision-Language Models
Authors: Jiayu Li, Rajesh Gangireddy, Samet Akcay, Wei Cheng, Juhua Hu,
Abstract summary: Vision-Language Models learn powerful multimodal representations through large-scale image-text pretraining.<n>Standard approaches treat labels as flat categories and require full fine-tuning, which is expensive and produces inconsistent predictions.<n>We propose an efficient hierarchy-aware fine-tuning framework that updates a few parameters while enforcing structural consistency.
Score: 18.244518940229202
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) learn powerful multimodal representations through large-scale image-text pretraining, but adapting them to hierarchical classification is underexplored. Standard approaches treat labels as flat categories and require full fine-tuning, which is expensive and produces inconsistent predictions across taxonomy levels. We propose an efficient hierarchy-aware fine-tuning framework that updates a few parameters while enforcing structural consistency. We combine two objectives: Tree-Path KL Divergence (TP-KL) aligns predictions along the ground-truth label path for vertical coherence, while Hierarchy-Sibling Smoothed Cross-Entropy (HiSCE) encourages consistent predictions among sibling classes. Both losses work in the VLM's shared embedding space and integrate with lightweight LoRA adaptation. Experiments across multiple benchmarks show consistent improvements in Full-Path Accuracy and Tree-based Inconsistency Error with minimal parameter overhead. Our approach provides an efficient strategy for adapting VLMs to structured taxonomies.

Related papers

OmniVL-Guard: Towards Unified Vision-Language Forgery Detection and Grounding via Balanced RL [63.388513841293616]
Existing forgery detection methods fail to handle the interleaved text, images, and videos prevalent in real-world misinformation.<n>To bridge this gap, this paper targets to develop a unified framework for omnibus vision-language forgery detection and grounding.<n>We propose textbf OmniVL-Guard, a balanced reinforcement learning framework for omnibus vision-language forgery detection and grounding.
arXiv Detail & Related papers (2026-02-11T09:41:36Z)
Learning Consistent Taxonomic Classification through Hierarchical Reasoning [61.372270953201955]
We propose a two-stage, hierarchy-based reasoning framework designed to improve leaf-level accuracy and hierarchical consistency in taxonomic classification.<n>Our framework, implemented on the Qwen2.5-VL-7B model, outperforms its original 72B counterpart by over 10% in both leaf-level and hierarchical consistency accuracy.
arXiv Detail & Related papers (2026-01-21T03:00:00Z)
Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds [49.95082206008502]
Alignment across Trees is a method that constructs and aligns tree-like hierarchical features for both image and text modalities.<n>We introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers.
arXiv Detail & Related papers (2025-10-31T11:32:15Z)
Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models [4.935224714809964]
We introduce Hierarchical Alignment, a novel method that applies targeted DPO to distinct functional blocks of a model's layers.<n>Specifically, aligning the local layers (Local-Align) enhances grammatical fluency.<n> aligning the global layers (Global-Align) improves factual consistency as hypothesized but also proves to be the most effective strategy for enhancing logical coherence.
arXiv Detail & Related papers (2025-10-14T00:58:34Z)
Hierarchical Adaptive networks with Task vectors for Test-Time Adaptation [3.3834108313265916]
We propose Hierarchical Adaptive Networks with Task Vectors (Hi-Vec)<n>Hi-Vec allows existing methods to adapt to shifts of varying complexity.<n>We rigorously evaluate the performance of Hi-Vec in challenging scenarios and on multiple target datasets.
arXiv Detail & Related papers (2025-08-11T21:55:53Z)
GLiClass: Generalist Lightweight Model for Sequence Classification Tasks [49.2639069781367]
We propose GLiClass, a novel method that adapts the GLiNER architecture for sequence classification tasks.<n>Our approach achieves strong accuracy and efficiency comparable to embedding-based methods, while maintaining the flexibility needed for zero-shot and few-shot learning scenarios.
arXiv Detail & Related papers (2025-08-11T06:22:25Z)
HieraRS: A Hierarchical Segmentation Paradigm for Remote Sensing Enabling Multi-Granularity Interpretation and Cross-Domain Transfer [5.2379726950251655]
HieraRS is a novel hierarchical interpretation paradigm that enables multi-granularity predictions.<n>It supports the efficient transfer of LCLU models to cross-domain tasks with heterogeneous tree-structured hierarchies.
arXiv Detail & Related papers (2025-07-11T16:44:01Z)
TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree [52.44403214958304]
In this paper, we introduce TreeLoRA, a novel approach that constructs layer-wise adapters by leveraging hierarchical gradient similarity.<n>To reduce the computational burden of task similarity estimation, we employ bandit techniques to develop an algorithm based on lower confidence bounds.<n> experiments on both vision transformers (ViTs) and large language models (LLMs) demonstrate the effectiveness and efficiency of our approach.
arXiv Detail & Related papers (2025-06-12T05:25:35Z)
Enforcing Consistency and Fairness in Multi-level Hierarchical Classification with a Mask-based Output Layer [25.819440955594736]
We introduce a fair, model-agnostic layer designed to enforce taxonomy and optimize objectives, including consistency, fairness, and exact match.<n>Our evaluations demonstrate that the proposed layer not only improves the fairness of predictions but also enforces the taxonomy, resulting in consistent predictions and superior performance.
arXiv Detail & Related papers (2025-03-19T06:30:04Z)
Learning and Evaluating Hierarchical Feature Representations [3.770103075126785]
We propose a novel framework, Hierarchical Composition of Orthogonal Subspaces (Hier-COS)<n>Hier-COS learns to map deep feature embeddings into a vector space that is, by design, consistent with the structure of a given taxonomy tree.<n>We demonstrate that Hier-COS achieves state-of-the-art hierarchical performance across all the datasets while simultaneously beating top-1 accuracy in all but one case.
arXiv Detail & Related papers (2025-03-10T20:59:41Z)
ProTeCt: Prompt Tuning for Taxonomic Open Set Classification [59.59442518849203]
Few-shot adaptation methods do not fare well in the taxonomic open set (TOS) setting. We propose a prompt tuning technique that calibrates the hierarchical consistency of model predictions. A new Prompt Tuning for Hierarchical Consistency (ProTeCt) technique is then proposed to calibrate classification across label set granularities.
arXiv Detail & Related papers (2023-06-04T02:55:25Z)
Weakly-supervised Action Localization via Hierarchical Mining [76.00021423700497]
Weakly-supervised action localization aims to localize and classify action instances in the given videos temporally with only video-level categorical labels. We propose a hierarchical mining strategy under video-level and snippet-level manners, i.e., hierarchical supervision and hierarchical consistency mining. We show that HiM-Net outperforms existing methods on THUMOS14 and ActivityNet1.3 datasets with large margins by hierarchically mining the supervision and consistency.
arXiv Detail & Related papers (2022-06-22T12:19:09Z)
Tree-structured Attention with Hierarchical Accumulation [103.47584968330325]
"Hierarchical Accumulation" encodes parse tree structures into self-attention at constant time complexity. Our approach outperforms SOTA methods in four IWSLT translation tasks and the WMT'14 English-German translation task.
arXiv Detail & Related papers (2020-02-19T08:17:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.