Related papers: ABC: Achieving Better Control of Multimodal Embeddings using VLMs

ABC: Achieving Better Control of Multimodal Embeddings using VLMs

URL: http://arxiv.org/abs/2503.00329v2
Date: Wed, 20 Aug 2025 19:09:06 GMT
Title: ABC: Achieving Better Control of Multimodal Embeddings using VLMs
Authors: Benjamin Schneider, Florian Kerschbaum, Wenhu Chen,
Abstract summary: Visual embedding models excel at zero-shot tasks like visual retrieval and classification.<n>These models cannot be used for tasks that contain ambiguity or require user instruction.<n>We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone.
Score: 61.396457715710774
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual embedding models excel at zero-shot tasks like visual retrieval and classification. However, these models cannot be used for tasks that contain ambiguity or require user instruction. These tasks necessitate an embedding model which outputs can use a natural language instruction to control the representation of a visual embedding. Existing CLIP-based approaches embed images and text independently, and fuse the result. We find that this results in weak interactions between modalities, and poor user control over the representation. We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone to deeply integrate image features with natural language instructions. ABC achieves best-for-size performance on MSCOCO image-to-text retrieval and is the top performing model on classification and VQA tasks in the Massive Multimodal Embedding Benchmark. With a strongly unified vision-language representation, ABC can use natural language to solve subtle and potentially ambiguous visual retrieval problems. To evaluate this capability, we design CtrlBench, a benchmark that requires interleaving textual instructions with image content for correct retrieval. ABC advances the state of visual embeddings, outputting high-quality visual representations with natural language control. Our model and datasets are available at our project page: https://tiger-ai-lab.github.io/ABC/

Related papers

Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images [5.587329786636647]
Contrastive Language-Image Pretraining (CLIP) is a highly effective method for aligning images and texts in a shared embedding space.<n>CLIP models often struggle with text-only tasks, underperforming compared to specialized text models.<n>In this work, we build upon our previous model, jina-clip-v1, by introducing a refined framework that utilizes multi-task, multi-stage contrastive learning across multiple languages.<n>The resulting model, jina-clip-v2, outperforms its predecessor on text-only and multimodal tasks, while adding multilingual support, better understanding of complex visual documents and efficiency gains.
arXiv Detail & Related papers (2024-12-11T22:28:12Z)
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension. Our findings indicate that MLLMs consistently fall short of human performance on this benchmark. This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z)
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training [46.570154746311935]
We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language pre-training. Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features. Our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task.
arXiv Detail & Related papers (2023-07-13T21:08:15Z)
Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models. Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z)
Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks [17.97052348690598]
Large language models have demonstrated robust performance on various language tasks using zero-shot or few-shot learning paradigms. multimodal models that can additionally handle images as input have yet to catch up in size and generality with language-only models. We make visual information accessible to the language model using separate verbalisation models.
arXiv Detail & Related papers (2023-05-23T07:50:36Z)
Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models. LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs. This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z)
Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
Unifying Vision-and-Language Tasks via Text Generation [81.3910771082967]
We propose a unified framework that learns different tasks in a single architecture. Our models learn to generate labels in text based on the visual and textual inputs. Our generative approach shows better generalization ability on answering questions that have rare answers.
arXiv Detail & Related papers (2021-02-04T17:59:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.