Let ViT Speak: Generative Language-Image Pre-training
Abstract Overview
The paper introduces GenLIP, a minimalist generative language-image pretraining framework for Vision Transformers designed to serve as vision encoders in multimodal large language models. GenLIP uses a single transformer over concatenated image patches and text tokens, with prefix-LM attention and multimodal rotary position encoding, and trains the model to predict text tokens directly with a standard autoregressive language-modeling objective. To address attention sink behavior that harms visual representations, the authors add gated attention and pretrain in two stages: 8B samples at fixed 224 resolution from Recap-DataComp-1B, followed by 37M higher-resolution, native-aspect-ratio caption samples. The study evaluates direct caption generation, patch-semantic readout, frozen and standard LLaVA-NeXT-based multimodal benchmarks, scaling behavior, ablations, and discriminative transfer tasks.
Novelty
The distinctive idea is to pretrain a ViT-based vision encoder by letting it predict language tokens directly from visual tokens using a single transformer and a single autoregressive objective, without a contrastive two-tower setup and without an auxiliary text decoder. The work also introduces gated attention to mitigate attention sink effects that degrade spatial diversity in visual representations during mixed visual-text modeling.
Results
Across frozen-feature multimodal evaluations, GenLIP consistently outperforms or matches strong baselines trained on larger corpora (up to 40B pairs), with especially strong gains on document and OCR benchmarks. With Qwen2.5-1.5B, GenLIP reaches ALL AVG scores of 61.5, 62.6, and 65.2 at L/16, So/16, and g/16, compared with 58.7, 60.6, and 61.5 for SigLIP2; with Qwen2.5-7B, GenLIP reaches 69.0, 71.8, and 73.6 versus SigLIP2's 69.4 and 68.9 at So/16 and g/16. Gated attention improves convergence, data efficiency, and discriminative transfer (76.2 vs. 84.3 ImageNet top-1 accuracy for So/16 without vs. with gated attention).
Key Points
- GenLIP replaces multi-component vision-language pretraining designs with a single transformer and a single autoregressive language-modeling objective over concatenated visual and text tokens, eliminating the need for contrastive losses or separate text decoders.
- A gated attention mechanism is introduced to reduce attention sink behavior—where the first token absorbs disproportionate attention mass—which the authors show causes training instability and degraded discriminative visual features in the mixed-modality setting.
- Empirically, GenLIP demonstrates strong data efficiency, surpassing SigLIP2 (trained on 40B pairs) and other baselines despite using only 8B pretraining samples, with particularly large gains on Doc&OCR benchmarks (e.g., +5.9 points average over SigLIP2 at g/16 scale with Qwen2.5-1.5B).