MIEB: Massive Image Embedding Benchmark
- URL: http://arxiv.org/abs/2504.10471v1
- Date: Mon, 14 Apr 2025 17:54:28 GMT
- Title: MIEB: Massive Image Embedding Benchmark
- Authors: Chenghao Xiao, Isaac Chung, Imene Kerboua, Jamie Stirling, Xin Zhang, Márton Kardos, Roman Solomatin, Noura Al Moubayed, Kenneth Enevoldsen, Niklas Muennighoff,
- Abstract summary: We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models.<n>MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories.<n>We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories.
- Score: 12.080155288744594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image representations are often evaluated through disjointed, task-specific protocols, leading to a fragmented understanding of model capabilities. For instance, it is unclear whether an image embedding model adept at clustering images is equally good at retrieving relevant images given a piece of text. We introduce the Massive Image Embedding Benchmark (MIEB) to evaluate the performance of image and image-text embedding models across the broadest spectrum to date. MIEB spans 38 languages across 130 individual tasks, which we group into 8 high-level categories. We benchmark 50 models across our benchmark, finding that no single method dominates across all task categories. We reveal hidden capabilities in advanced vision models such as their accurate visual representation of texts, and their yet limited capabilities in interleaved encodings and matching images and texts in the presence of confounders. We also show that the performance of vision encoders on MIEB correlates highly with their performance when used in multimodal large language models. Our code, dataset, and leaderboard are publicly available at https://github.com/embeddings-benchmark/mteb.
Related papers
- ABC: Achieving Better Control of Multimodal Embeddings using VLMs [61.396457715710774]
Visual embedding models excel at zero-shot tasks like visual retrieval and classification.<n>Existing CLIP-based approaches embed images and text independently, and fuse the result.<n>We introduce ABC, an open-source multimodal embedding model that uses a vision-language model backbone.
arXiv Detail & Related papers (2025-03-01T03:29:02Z) - FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings.<n>Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - Generating Images with Multimodal Language Models [78.6660334861137]
We propose a method to fuse frozen text-only large language models with pre-trained image encoder and decoder models.
Our model demonstrates a wide suite of multimodal capabilities: image retrieval, novel image generation, and multimodal dialogue.
arXiv Detail & Related papers (2023-05-26T19:22:03Z) - Rethinking Benchmarks for Cross-modal Image-text Retrieval [44.31783230767321]
Cross-modal semantic understanding and matching is a major challenge in image-text retrieval.
In this paper, we review the two common benchmarks and observe that they are insufficient to assess the true capability of models on fine-grained cross-modal semantic matching.
We propose a novel semi-automatic renovation approach to refine coarse-grained sentences into finer-grained ones with little human effort.
The results show that even the state-of-the-art models have much room for improvement in fine-grained semantic understanding.
arXiv Detail & Related papers (2023-04-21T09:07:57Z) - Deep Multimodal Image-Text Embeddings for Automatic Cross-Media
Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously.
The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.