Related papers: Entity Image and Mixed-Modal Image Retrieval Datasets

Entity Image and Mixed-Modal Image Retrieval Datasets

URL: http://arxiv.org/abs/2506.02291v1
Date: Mon, 02 Jun 2025 22:04:06 GMT
Title: Entity Image and Mixed-Modal Image Retrieval Datasets
Authors: Cristian-Ioan Blaga, Paul Suganthan, Sahil Dua, Krishna Srinivasan, Enrique Alfonseca, Peter Dornbach, Tom Duerig, Imed Zitouni, Zhe Dong,
Abstract summary: This paper introduces a novel benchmark to rigorously evaluate image retrieval that demands deep cross-modal contextual understanding.<n>We present two new datasets: the Entity Image dataset (EI), providing canonical images for Wikipedia entities, and the Mixed-Modal Image Retrieval dataset (MMIR), derived from the WIT dataset.<n>We empirically validate the benchmark's utility as both a training corpus and an evaluation set for mixed-modal retrieval.
Score: 9.6977953463099
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Despite advances in multimodal learning, challenging benchmarks for mixed-modal image retrieval that combines visual and textual information are lacking. This paper introduces a novel benchmark to rigorously evaluate image retrieval that demands deep cross-modal contextual understanding. We present two new datasets: the Entity Image Dataset (EI), providing canonical images for Wikipedia entities, and the Mixed-Modal Image Retrieval Dataset (MMIR), derived from the WIT dataset. The MMIR benchmark features two challenging query types requiring models to ground textual descriptions in the context of provided visual entities: single entity-image queries (one entity image with descriptive text) and multi-entity-image queries (multiple entity images with relational text). We empirically validate the benchmark's utility as both a training corpus and an evaluation set for mixed-modal retrieval. The quality of both datasets is further affirmed through crowd-sourced human annotations. The datasets are accessible through the GitHub page: https://github.com/google-research-datasets/wit-retrieval.

Related papers

Instance-Level Composed Image Retrieval [34.04479584450632]
i-CIR is a new evaluation dataset that focuses on an instance-level class definition.<n>Its design and curation process keep the dataset compact to facilitate future research.<n>We leverage pre-trained vision-and-language models (VLMs) in a training-free approach called BASIC.
arXiv Detail & Related papers (2025-10-29T10:57:59Z)
JourneyDB: A Benchmark for Generative Image Understanding [89.02046606392382]
We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images. Our meticulously curated dataset comprises 4 million distinct and high-quality generated images. On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
arXiv Detail & Related papers (2023-07-03T02:39:08Z)
EDIS: Entity-Driven Image Search over Multimodal Web Content [95.40238328527931]
We introduce textbfEntity-textbfDriven textbfImage textbfSearch (EDIS), a dataset for cross-modal image search in the news domain. EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description.
arXiv Detail & Related papers (2023-05-23T02:59:19Z)
AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation [42.35572014527354]
The AToMiC dataset is designed to advance research in image/text cross-modal retrieval. We leverage hierarchical structures and diverse domains of texts, styles, and types of images, as well as large-scale image-document associations embedded in Wikipedia. AToMiC offers a testbed for scalable, diverse, and reproducible multimedia retrieval research.
arXiv Detail & Related papers (2023-04-04T17:11:34Z)
Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE. We propose a novel Multi-modal Retrieval based framework (MoRe) MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z)
Re-Imagen: Retrieval-Augmented Text-to-Image Generator [58.60472701831404]
Retrieval-Augmented Text-to-Image Generator (Re-Imagen) Retrieval-Augmented Text-to-Image Generator (Re-Imagen)
arXiv Detail & Related papers (2022-09-29T00:57:28Z)
Progressive Learning for Image Retrieval with Hybrid-Modality Queries [48.79599320198615]
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR) We decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries. Our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
arXiv Detail & Related papers (2022-04-24T08:10:06Z)
ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and Implicit Similarity [16.550790981646276]
Current approaches combine the features of each of the two elements of the query into a single representation. Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval.
arXiv Detail & Related papers (2022-03-15T17:29:20Z)
Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously. The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z)
Expressing Objects just like Words: Recurrent Visual Embedding for Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image. We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN) Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.