Related papers: Prompt the Unseen: Evaluating Visual-Language Alignment Beyond Supervision

Prompt the Unseen: Evaluating Visual-Language Alignment Beyond Supervision

URL: http://arxiv.org/abs/2509.00700v2
Date: Tue, 09 Sep 2025 03:19:40 GMT
Title: Prompt the Unseen: Evaluating Visual-Language Alignment Beyond Supervision
Authors: Raehyuk Jung, Seungjun Yu, Hyunjung Shim,
Abstract summary: Vision-Language Models (VLMs) combine a vision encoder and a large language model (LLM) through alignment training.<n>Despite its importance, the projection layer's ability to generalize to unseen visual concepts has not been systematically evaluated.<n>This study introduces a new evaluation framework for alignment generalization.
Score: 22.712690974750007
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) combine a vision encoder and a large language model (LLM) through alignment training, showing strong performance on multimodal tasks. A central component in this architecture is the projection layer, which maps visual features into the LLM's embedding space. Despite its importance, its ability to generalize to unseen visual concepts has not been systematically evaluated. To address this, we propose a benchmark for evaluating projection-layer generalization. We adapt object detection datasets (rich in fine-grained annotations) into a prompting format and design train/test splits with disjoint label sets, enabling precise control over seen and unseen concept separation. Experimental results show that the projection layer retains about 79 to 88 percent of the performance on unseen classes compared to seen ones across various settings, suggesting a non-trivial level of generalization even without explicit alignment supervision on those concepts. We further analyze this behavior through a mechanistic interpretability lens. Our findings indicate that the feed-forward network in the projection layer functions like a key-value memory, processing seen and unseen tokens in similar ways. This study introduces a new evaluation framework for alignment generalization and highlights the potential for efficient VLM training with limited aligned data.

Related papers

Stateful Cross-layer Vision Modulation [19.730096071316876]
multimodal large language models (MLLMs) widely adopt multi-layer visual feature fusion to enhance visual representation.<n>Existing approaches typically perform static concatenation or weighted aggregation after visual encoding, without intervening in the representation formation process itself.<n>We propose a cross-layer memory-modulated vision framework(SCVM) to address these limitations.
arXiv Detail & Related papers (2026-02-28T13:57:19Z)
Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z)
AUVIC: Adversarial Unlearning of Visual Concepts for Multi-modal Large Language Models [63.05306474002547]
Regulatory frameworks mandating the 'right to be forgotten' drive the need for machine unlearning.<n>We introduce AUVIC, a novel visual concept unlearning framework for MLLMs.<n>We show that AUVIC achieves state-of-the-art target forgetting rates while incurs minimal performance degradation on non-target concepts.
arXiv Detail & Related papers (2025-11-14T13:35:32Z)
ConceptScope: Characterizing Dataset Bias via Disentangled Visual Concepts [54.60525564599342]
ConceptScope is a scalable and automated framework for analyzing visual datasets.<n>It categorizes concepts into target, context, and bias types based on their semantic relevance and statistical correlation to class labels.<n>It reliably detects known biases and uncovers previously unannotated ones.
arXiv Detail & Related papers (2025-10-30T06:46:17Z)
Self-supervised structured object representation learning [2.747398258852965]
Self-supervised learning has emerged as a powerful technique for learning visual representations.<n>We propose a self-supervised approach that builds structured visual representations by combining semantic grouping, instance level separation, and hierarchical structuring.<n>Our approach, based on a novel ProtoScale module, captures visual elements across multiple spatial scales.
arXiv Detail & Related papers (2025-08-27T13:28:05Z)
Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z)
Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.<n>Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.<n>We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z)
Control-oriented Clustering of Visual Latent Representation [3.9838014203847862]
We study the geometry of the visual representation space in an image-based control pipeline learned from behavior cloning.<n>Inspired by the phenomenon of neural collapse, we show a similar law of clustering in the visual representation space.<n>We show such a law of clustering can be leveraged as an algorithmic tool to improve test-time performance.
arXiv Detail & Related papers (2024-10-07T14:21:51Z)
Refining Skewed Perceptions in Vision-Language Contrastive Models through Visual Representations [0.033483662989441935]
Large vision-language contrastive models (VLCMs) have become foundational, demonstrating remarkable success across a variety of downstream tasks.<n>Despite their advantages, these models inherit biases from the disproportionate distribution of real-world data, leading to misconceptions about the actual environment.<n>This study presents an investigation into how a simple linear probe can effectively distill task-specific core features from CLIP's embedding for downstream applications.
arXiv Detail & Related papers (2024-05-22T22:03:11Z)
Neural Clustering based Visual Representation Learning [61.72646814537163]
Clustering is one of the most classic approaches in machine learning and data analysis. We propose feature extraction with clustering (FEC), which views feature extraction as a process of selecting representatives from data. FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives.
arXiv Detail & Related papers (2024-03-26T06:04:50Z)
Vi(E)va LLM! A Conceptual Stack for Evaluating and Interpreting Generative AI-based Visualizations [1.709620026135923]
Large language models (LLM) have become an interesting option for supporting generative tasks related to visualization. This paper copes with the problem of modeling the evaluation of a generated visualization through an LLM. We propose a theoretical evaluation stack, EvaLLM, that decomposes the evaluation effort in its atomic components.
arXiv Detail & Related papers (2024-02-03T14:28:55Z)
Self-supervised Learning of Contextualized Local Visual Embeddings [0.0]
Contextualized Local Visual Embeddings (CLoVE) is a self-supervised convolutional-based method that learns representations suited for dense prediction tasks. We benchmark CLoVE's pre-trained representations on multiple datasets. CLoVE reaches state-of-the-art performance for CNN-based architectures in 4 dense prediction downstream tasks.
arXiv Detail & Related papers (2023-10-01T00:13:06Z)
Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner. We design a semantic-guided self-supervised learning model to extract high-level semantic features from images. We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z)
UniT: Unified Knowledge Transfer for Any-shot Object Detection and Segmentation [52.487469544343305]
Methods for object detection and segmentation rely on large scale instance-level annotations for training. We propose an intuitive and unified semi-supervised model that is applicable to a range of supervision.
arXiv Detail & Related papers (2020-06-12T22:45:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.