VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
- URL: http://arxiv.org/abs/2512.10942v1
- Date: Thu, 11 Dec 2025 18:59:22 GMT
- Title: VL-JEPA: Joint Embedding Predictive Architecture for Vision-language
- Authors: Delong Chen, Mustafa Shukor, Theo Moutakanni, Willy Chung, Jade Yu, Tejaswi Kasarla, Allen Bolourchi, Yann LeCun, Pascale Fung,
- Abstract summary: We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA)<n>By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability.<n>At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text.
- Score: 54.86811250366009
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce VL-JEPA, a vision-language model built on a Joint Embedding Predictive Architecture (JEPA). Instead of autoregressively generating tokens as in classical VLMs, VL-JEPA predicts continuous embeddings of the target texts. By learning in an abstract representation space, the model focuses on task-relevant semantics while abstracting away surface-level linguistic variability. In a strictly controlled comparison against standard token-space VLM training with the same vision encoder and training data, VL-JEPA achieves stronger performance while having 50% fewer trainable parameters. At inference time, a lightweight text decoder is invoked only when needed to translate VL-JEPA predicted embeddings into text. We show that VL-JEPA natively supports selective decoding that reduces the number of decoding operations by 2.85x while maintaining similar performance compared to non-adaptive uniform decoding. Beyond generation, the VL-JEPA's embedding space naturally supports open-vocabulary classification, text-to-video retrieval, and discriminative VQA without any architecture modification. On eight video classification and eight video retrieval datasets, the average performance VL-JEPA surpasses that of CLIP, SigLIP2, and Perception Encoder. At the same time, the model achieves comparable performance as classical VLMs (InstructBLIP, QwenVL) on four VQA datasets: GQA, TallyQA, POPE and POPEv2, despite only having 1.6B parameters.
Related papers
- Decoupling Vision and Language: Codebook Anchored Visual Adaptation [20.393987361723724]
Large Vision-Language Models (LVLMs) use their vision encoders to translate images into representations for downstream reasoning.<n>Existing adaptation methods modify the continuous feature interface between encoder and language model through projector tuning or other parameter-efficient updates.<n>We introduce CRAFT, a lightweight method that fine-tunes the encoder using a discrete codebook that anchors visual representations to a stable token space.
arXiv Detail & Related papers (2026-02-23T02:39:26Z) - Cost-Aware Model Selection for Text Classification: Multi-Objective Trade-offs Between Fine-Tuned Encoders and LLM Prompting in Production [0.0]
Large language models (LLMs) have demonstrated strong capabilities in open-ended reasoning and generative language tasks.<n>For structured text classification problems with fixed label spaces, model selection is often driven by predictive performance alone.<n>We show that fine-tuned encoder-based models from the BERT family achieve competitive, and often superior, classification performance.
arXiv Detail & Related papers (2026-02-06T03:54:28Z) - 1 + 1 > 2: Detector-Empowered Video Large Language Model for Spatio-Temporal Grounding and Reasoning [53.28271278708241]
We present a Detector-Empowered Video LLM, short for DEViL.<n> DEViL couples a Video LLM with an open-vocabulary detector (OVD)<n>Unlike tokens that merely serve as spatial prompts or segmentor switches, the RST functions as both a control signal and a replacement for the OVD's text embedding.
arXiv Detail & Related papers (2025-12-07T06:11:15Z) - LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs [52.24096832965001]
We present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method.<n>The PVC method can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding.<n>Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x.
arXiv Detail & Related papers (2025-11-26T08:11:10Z) - CogVLA: Cognition-Aligned Vision-Language-Action Model via Instruction-Driven Routing & Sparsification [48.81250395291505]
Recent Vision-Language-Action models require extensive post-training, resulting in high computational overhead.<n>We propose CogVLA, a framework that leverages instruction-driven routing and sparsification to improve both efficiency and performance.<n>CogVLA achieves state-of-the-art performance with success rates of 97.4% and 70.0%, respectively, while reducing training costs by 2.5-fold and decreasing inference latency by 2.8-fold compared to OpenVLA.
arXiv Detail & Related papers (2025-08-28T17:50:58Z) - LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding [29.719450799231705]
Vision-Language Models (VLMs) obtain frame-level understanding capabilities through multi-frame input.<n>Video Large Language Models (Video-LLMs) capture temporal relationships within visual features but are limited by the scarcity of high-quality video-text datasets.<n>We propose Lightweight Video Compression (LVC), a novel method featuring the Query-Attention Video Compression mechanism.
arXiv Detail & Related papers (2025-04-09T12:51:10Z) - VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models [84.84277196012907]
We propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes.<n>We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V.
arXiv Detail & Related papers (2024-12-02T18:58:25Z) - Enhancing JEPAs with Spatial Conditioning: Robust and Efficient Representation Learning [7.083341587100975]
Image-based Joint-Embedding Predictive Architecture (IJEPA) offers an attractive alternative to Masked Autoencoder (MAE)
IJEPA drives representations to capture useful semantic information by predicting in latent rather than input space.
Our "conditional" encoders show performance gains on several image classification benchmark datasets.
arXiv Detail & Related papers (2024-10-14T17:46:24Z) - Adapting Pre-trained Language Models to Vision-Language Tasks via
Dynamic Visual Prompting [83.21164539349273]
Pre-trained language models (PLMs) have played an increasing role in multimedia research.
In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks.
We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
arXiv Detail & Related papers (2023-06-01T07:19:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.