Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking
- URL: http://arxiv.org/abs/2510.06820v1
- Date: Wed, 08 Oct 2025 09:46:09 GMT
- Title: Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking
- Authors: Mitchell Keren Taraday, Shahaf Wagner, Chaim Baskin,
- Abstract summary: Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings.<n>Unlike text retrieval, where joint-encoder rerankers are standard, comparable vision--language rerankers are largely absent.<n>We introduce EDJE, an Efficient Discriminative Joint that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter.
- Score: 8.189266513060621
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision--language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image--text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval. The implementation and checkpoints will be made publicly available shortly.
Related papers
- PHOTON: Hierarchical Autoregressive Modeling for Lightspeed and Memory-Efficient Language Generation [5.553946791700077]
We propose a hierarchical autoregressive model that replaces flat scanning with vertical, multi-resolution context access.<n> Experimental results show that PHOTON is superior to competitive Transformer-based language models regarding the throughput-quality trade-off.
arXiv Detail & Related papers (2025-12-22T19:26:59Z) - CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs [29.08277140543501]
We introduce CORE (Compact Object-centric REpresentations), a new paradigm for visual token compression.<n> CORE leverages an efficient segmentation decoder to generate object masks, which serve as a high-level semantic prior to guide the merging of visual tokens.<n>Experiments show that CORE not only establishes a new state-of-the-art on six authoritative benchmarks for fixed-rate compression, but also achieves dramatic efficiency gains in adaptive-rate settings.
arXiv Detail & Related papers (2025-11-18T03:02:23Z) - Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions [81.33113485830711]
We introduce a vision-free, single-encoder retrieval pipeline for vision-language models.<n>We migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions.<n>Our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks.
arXiv Detail & Related papers (2025-09-23T16:22:27Z) - METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models [92.37117312251755]
We propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR)<n>For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy.<n>For multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning.
arXiv Detail & Related papers (2025-07-28T13:50:53Z) - End-to-End Semantic Preservation in Text-Aware Image Compression Systems [42.76781276416154]
We present an end-to-end compression framework that retains text-specific features for Optical Character Recognition (OCR)<n> Experiments show significant improvements in text extraction accuracy at lows, even outperforming OCR on uncompressed images.<n>We extend this study to general-purpose encoders, exploring their capacity to preserve hidden semantics under extreme compression.
arXiv Detail & Related papers (2025-03-25T09:36:13Z) - Vision-centric Token Compression in Large Language Model [51.92055188780033]
Vision Centric Token Compression (Vist) is a slow-fast compression framework that mirrors human reading.<n>On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%.
arXiv Detail & Related papers (2025-02-02T13:10:06Z) - Stable Diffusion is a Natural Cross-Modal Decoder for Layered AI-generated Image Compression [7.643300240138419]
We introduce a scalable cross-modal compression framework that incorporates multiple human-comprehensible modalities.<n>Our framework encodes images into a layered bitstream consisting of a semantic layer that delivers high-level semantic information.<n>Our method proficiently restores both semantic and visual details, competing against baseline approaches at extremely lows.
arXiv Detail & Related papers (2024-12-17T15:01:35Z) - Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.<n> compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.<n>We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z) - UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model.
We show that UNIT significantly outperforms existing methods on document-related tasks.
Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z) - Expediting Contrastive Language-Image Pretraining via Self-distilled
Encoders [10.649402840032138]
ECLIPSE features a distinctive distillation architecture wherein a shared text encoder is utilized between an online image encoder and a momentum image encoder.
Based on the unified text embedding space, ECLIPSE compensates for the additional computational cost of the momentum image encoder by expediting the online image encoder.
arXiv Detail & Related papers (2023-12-19T23:11:06Z) - LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale
Image-Text Retrieval [71.01982683581572]
The conventional dense retrieval paradigm relies on encoding images and texts into dense representations using dual-stream encoders.
We propose the lexicon-weighting paradigm, where sparse representations in vocabulary space are learned for images and texts.
We introduce a novel pre-training framework, that learns importance-aware lexicon representations.
Our framework achieves a 5.5 221.3X faster retrieval speed and 13.2 48.8X less index storage memory.
arXiv Detail & Related papers (2023-02-06T16:24:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.