FuguReport

Steerable Visual Representations

Authors Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano
Affiliations Carnegie Mellon University / University of Technology Nuremberg / International Institute of Information Technology Hyderabad
Categories Method / Visual Representation Learning / Steerable global and local features, Application / Anomaly Detection / Enhanced performance with tailored approaches, Application / Personalized Recognition / Personalized target identification
License CC BY 4.0

Abstract Overview

This paper introduces SteerViT, a method that makes pretrained vision transformer (ViT) representations steerable via natural language by inserting lightweight gated cross-attention layers into frozen ViT blocks, enabling text to influence intermediate visual features through early fusion. The model is trained with a referential segmentation objective on a mixture of grounding and segmentation datasets (162k images, 2.28M image-text pairs), adding only ~21M trainable parameters. The authors propose benchmarks for measuring representational steerability, including CORE (conditional retrieval) and MOSAIC (localization via attention). Experiments demonstrate that SteerViT achieves high steerability while preserving the base ViT's representation quality for classification and segmentation, and that it generalizes zero-shot to tasks such as personalized object discrimination and industrial anomaly segmentation.

Novelty

The main novelty is injecting text into frozen ViT layers via lightweight, zero-initialized gated cross-attention (early fusion), enabling vision-centric multimodal representations that are steerable by language—inverting the typical MLLM paradigm of conditioning language on vision. The paper also introduces benchmarks (CORE, MOSAIC) specifically designed to measure representational steerability and demonstrates that text specificity controls the semantic granularity of the resulting features.

Results

On the CORE conditional retrieval benchmark, SteerViT achieves 96.0% top-1 accuracy compared to 43.7% for DINOv2 and 81.3% for FLAIR, while preserving or slightly improving the base ViT's downstream classification and segmentation performance. On personalized object discrimination (PODS), detailed text conditioning boosts SteerViT to 58.1% PR-AUC, surpassing fine-tuned DINOv2 variants (48.0%) without task-specific training. On zero-shot anomaly segmentation (MVTec AD), SteerViT achieves 82.1 PRO, approaching the best dedicated method (FADE, 84.5) and outperforming several other specialized baselines.

Key Points

  1. SteerViT inserts lightweight gated cross-attention layers (~21M parameters) into frozen ViT blocks, enabling early fusion of text into visual encoding; this achieves 96.0% conditional retrieval accuracy on CORE versus 43.7% for vanilla DINOv2 and consistently outperforms late fusion across DINOv2, SigLIP, and MAE backbones.
  2. A tanh-gated scaling mechanism allows continuous interpolation between the unaltered ViT and fully text-conditioned representations at inference, with an optimal operating point (ω=0.6) that preserves or slightly improves the base ViT's classification and segmentation performance while enabling high steerability.
  3. Text prompt specificity directly controls the semantic granularity of the steered features, enabling zero-shot transfer to personalized object discrimination (58.1% PR-AUC on PODS, surpassing fine-tuned DINOv2 at 48.0%) and industrial anomaly segmentation (82.1 PRO on MVTec AD) without task-specific training.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.