Related papers: Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking

URL: http://arxiv.org/abs/2510.06820v1
Date: Wed, 08 Oct 2025 09:46:09 GMT
Title: Efficient Discriminative Joint Encoders for Large Scale Vision-Language Reranking
Authors: Mitchell Keren Taraday, Shahaf Wagner, Chaim Baskin,
Abstract summary: Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings.<n>Unlike text retrieval, where joint-encoder rerankers are standard, comparable vision--language rerankers are largely absent.<n>We introduce EDJE, an Efficient Discriminative Joint that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter.
Score: 8.189266513060621
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multimodal retrieval still leans on embedding-based models like CLIP for fast vector search over pre-computed image embeddings. Yet, unlike text retrieval, where joint-encoder rerankers are standard, comparable vision--language rerankers are largely absent. We find that seminal joint encoders such as BLIP are severely bottlenecked by an expensive visual feature-extraction stage, preventing practical deployment at scale. Motivated by this bottleneck, we introduce EDJE, an Efficient Discriminative Joint Encoder that precomputes vision tokens offline and compresses them via a lightweight attention-based adapter, so online inference runs only a compact joint encoder over a small set of visual tokens plus the text. EDJE preserves strong retrieval performance while drastically reducing storage and online compute, enabling high-throughput inference. Specifically, EDJE processes 50k image--text pairs/second while requiring 49kB of disk storage per image, matching prior art on Flickr (zero-shot) and COCO (fine-tuned) retrieval. The implementation and checkpoints will be made publicly available shortly.

Related papers

PHOTON: Hierarchical Autoregressive Modeling for Lightspeed and Memory-Efficient Language Generation [5.553946791700077]
We propose a hierarchical autoregressive model that replaces flat scanning with vertical, multi-resolution context access.<n> Experimental results show that PHOTON is superior to competitive Transformer-based language models regarding the throughput-quality trade-off.
arXiv Detail & Related papers (2025-12-22T19:26:59Z)
CORE: Compact Object-centric REpresentations as a New Paradigm for Token Merging in LVLMs [29.08277140543501]
We introduce CORE (Compact Object-centric REpresentations), a new paradigm for visual token compression.<n> CORE leverages an efficient segmentation decoder to generate object masks, which serve as a high-level semantic prior to guide the merging of visual tokens.<n>Experiments show that CORE not only establishes a new state-of-the-art on six authoritative benchmarks for fixed-rate compression, but also achieves dramatic efficiency gains in adaptive-rate settings.
arXiv Detail & Related papers (2025-11-18T03:02:23Z)
Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions [81.33113485830711]
We introduce a vision-free, single-encoder retrieval pipeline for vision-language models.<n>We migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions.<n>Our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks.
arXiv Detail & Related papers (2025-09-23T16:22:27Z)
METEOR: Multi-Encoder Collaborative Token Pruning for Efficient Vision Language Models [92.37117312251755]
We propose a progressive pruning framework, namely Multi-Encoder collaboraTivE tOken pRuning (METEOR)<n>For multi-vision encoding, we discard redundant tokens within each encoder via a rank guided collaborative token assignment strategy.<n>For multi-vision fusion, we combine the visual features from different encoders while reducing cross-encoder redundancy with cooperative pruning.
arXiv Detail & Related papers (2025-07-28T13:50:53Z)
End-to-End Semantic Preservation in Text-Aware Image Compression Systems [42.76781276416154]
We present an end-to-end compression framework that retains text-specific features for Optical Character Recognition (OCR)<n> Experiments show significant improvements in text extraction accuracy at lows, even outperforming OCR on uncompressed images.<n>We extend this study to general-purpose encoders, exploring their capacity to preserve hidden semantics under extreme compression.
arXiv Detail & Related papers (2025-03-25T09:36:13Z)
Vision-centric Token Compression in Large Language Model [51.92055188780033]
Vision Centric Token Compression (Vist) is a slow-fast compression framework that mirrors human reading.<n>On eleven in-context learning benchmarks, Vist achieves the same accuracy with 2.3 times fewer tokens, cutting FLOPs by 16% and memory by 50%.
arXiv Detail & Related papers (2025-02-02T13:10:06Z)
Stable Diffusion is a Natural Cross-Modal Decoder for Layered AI-generated Image Compression [7.643300240138419]
We introduce a scalable cross-modal compression framework that incorporates multiple human-comprehensible modalities.<n>Our framework encodes images into a layered bitstream consisting of a semantic layer that delivers high-level semantic information.<n>Our method proficiently restores both semantic and visual details, competing against baseline approaches at extremely lows.
arXiv Detail & Related papers (2024-12-17T15:01:35Z)
Efficient Multi-modal Large Language Models via Visual Token Grouping [55.482198808206284]
High-resolution images and videos pose a barrier to their broader adoption.<n> compressing vision tokens in MLLMs has emerged as a promising approach to reduce inference costs.<n>We introduce VisToG, a novel grouping mechanism that leverages the capabilities of pre-trained vision encoders to group similar image segments.
arXiv Detail & Related papers (2024-11-26T09:36:02Z)
UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model. We show that UNIT significantly outperforms existing methods on document-related tasks. Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z)
Expediting Contrastive Language-Image Pretraining via Self-distilled Encoders [10.649402840032138]
ECLIPSE features a distinctive distillation architecture wherein a shared text encoder is utilized between an online image encoder and a momentum image encoder. Based on the unified text embedding space, ECLIPSE compensates for the additional computational cost of the momentum image encoder by expediting the online image encoder.
arXiv Detail & Related papers (2023-12-19T23:11:06Z)
LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval [71.01982683581572]
The conventional dense retrieval paradigm relies on encoding images and texts into dense representations using dual-stream encoders. We propose the lexicon-weighting paradigm, where sparse representations in vocabulary space are learned for images and texts. We introduce a novel pre-training framework, that learns importance-aware lexicon representations. Our framework achieves a 5.5 221.3X faster retrieval speed and 13.2 48.8X less index storage memory.
arXiv Detail & Related papers (2023-02-06T16:24:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.