Decoupling Vision and Language: Codebook Anchored Visual Adaptation
- URL: http://arxiv.org/abs/2602.19449v1
- Date: Mon, 23 Feb 2026 02:39:26 GMT
- Title: Decoupling Vision and Language: Codebook Anchored Visual Adaptation
- Authors: Jason Wu, Tianchen Zhao, Chang Liu, Jiarui Cai, Zheng Zhang, Zhuowei Li, Aaditya Singh, Xiang Xu, Mani Srivastava, Jonathan Wu,
- Abstract summary: Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning.<n>Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates.<n>We introduce CRAFT, a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space.
- Score: 20.393987361723724
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning, but the encoders often underperform in domain-specific visual tasks such as medical image diagnosis or fine-grained classification, where representation errors can cascade through the language model, leading to incorrect responses. Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates, which still couples the two components and requires re-alignment whenever the encoder changes. We introduce CRAFT (Codebook RegulAted Fine-Tuning), a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space, achieving domain adaptation without modifying other parts of the model. This decoupled design allows the adapted encoder to seamlessly boost the performance of LVLMs with different language architectures, as long as they share the same codebook. Empirically, CRAFT achieves an average gain of 13.51% across 10 domain-specific benchmarks such as VQARAD and PlantVillage, while preserving the LLM's linguistic capabilities and outperforming peer methods that operate on continuous tokens.
Related papers
- Video TokenCom: Textual Intent-Guided Multi-Rate Video Token Communications with UEP-Based Adaptive Source-Channel Coding [24.169863403324314]
Token Communication (TokenCom) is a new paradigm, motivated by the recent success of Large AI Models (LAMs) and Multimodal Large Language Models (MLLMs)<n>We propose a novel Video TokenCom framework for textual intent-guided multi-rate video communication.
arXiv Detail & Related papers (2026-03-02T23:36:38Z) - VL-JEPA: Joint Embedding Predictive Architecture for Vision-language [54.86811250366009]
We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA)<n>By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability.<n>At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text.
arXiv Detail & Related papers (2025-12-11T18:59:22Z) - METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models [92.37117312251755]
We propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR)<n>For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy.<n>For multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning.
arXiv Detail & Related papers (2025-07-28T13:50:53Z) - FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation [55.01077993490845]
Recent Large Vision Language Models (LVLMs) demonstrate promising capabilities in unifying visual understanding and generative modeling.<n>We introduce FOCUS, a unified LVLM that integrates segmentation-aware perception and controllable object-centric generation within an end-to-end framework.
arXiv Detail & Related papers (2025-06-20T07:46:40Z) - Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization [20.063863466319326]
SignViP is a novel framework that incorporates multiple fine-grained conditions for improved generation fidelity.<n>SignViP achieves state-of-the-art performance across metrics, including video quality temporal coherence, and semantic fidelity.
arXiv Detail & Related papers (2025-06-19T02:56:06Z) - Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models [18.02840698188587]
We propose a novel kernel-based method to align CLIP's visual representation with that of DINOv2.<n>Our image-only alignment fine-tuning exhibits significant improvements in zero-shot object recognition, fine-grained spatial reasoning.
arXiv Detail & Related papers (2025-06-03T07:44:43Z) - MoSE: Hierarchical Self-Distillation Enhances Early Layer Embeddings [2.1262605464247812]
Self-Distillation is a principled approach to trading inference cost for accuracy across various code understanding tasks.<n>Our architecture improves text-to-code and code-to-code search by targeting specific encoder layers as exit heads.<n>We release a new dataset created through code translation that extends text-to-code benchmarks with cross-language code-to-code pairs.
arXiv Detail & Related papers (2025-03-04T21:08:17Z) - Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment [57.0121616203175]
We propose FiSAO, a novel self-alignment method that utilizes the model's own visual encoder as a fine-grained verifier to improve vision-language alignment.<n>By leveraging token-level feedback from the vision encoder, FiSAO significantly improves vision-language alignment, even surpassing traditional preference tuning methods that require additional data.
arXiv Detail & Related papers (2024-10-18T03:34:32Z) - Do Vision and Language Encoders Represent the World Similarly? [22.70701869402434]
Aligned text-image encoders such as CLIP have become the de facto model for vision-language tasks.
We find that the representation spaces of unaligned and aligned encoders are semantically similar.
In the absence of statistical similarity in aligned encoders like CLIP, we show that a possible matching of unaligned encoders exists without any training.
arXiv Detail & Related papers (2024-01-10T15:51:39Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Adversarial Neural Networks for Error Correcting Codes [76.70040964453638]
We introduce a general framework to boost the performance and applicability of machine learning (ML) models.
We propose to combine ML decoders with a competing discriminator network that tries to distinguish between codewords and noisy words.
Our framework is game-theoretic, motivated by generative adversarial networks (GANs)
arXiv Detail & Related papers (2021-12-21T19:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.