EDIS: Entity-Driven Image Search over Multimodal Web Content
- URL: http://arxiv.org/abs/2305.13631v2
- Date: Mon, 23 Oct 2023 05:42:51 GMT
- Title: EDIS: Entity-Driven Image Search over Multimodal Web Content
- Authors: Siqi Liu, Weixi Feng, Tsu-jui Fu, Wenhu Chen, William Yang Wang
- Abstract summary: We introduce textbfEntity-textbfDriven textbfImage textbfSearch (EDIS), a dataset for cross-modal image search in the news domain.
EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description.
- Score: 95.40238328527931
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Making image retrieval methods practical for real-world search applications
requires significant progress in dataset scales, entity comprehension, and
multimodal information fusion. In this work, we introduce
\textbf{E}ntity-\textbf{D}riven \textbf{I}mage \textbf{S}earch (EDIS), a
challenging dataset for cross-modal image search in the news domain. EDIS
consists of 1 million web images from actual search engine results and curated
datasets, with each image paired with a textual description. Unlike datasets
that assume a small set of single-modality candidates, EDIS reflects real-world
web image search scenarios by including a million multimodal image-text pairs
as candidates. EDIS encourages the development of retrieval models that
simultaneously address cross-modal information fusion and matching. To achieve
accurate ranking results, a model must: 1) understand named entities and events
from text queries, 2) ground entities onto images or text descriptions, and 3)
effectively fuse textual and visual representations. Our experimental results
show that EDIS challenges state-of-the-art methods with dense entities and a
large-scale candidate set. The ablation study also proves that fusing textual
features with visual features is critical in improving retrieval results.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching [65.87255122130188]
We propose a Multi-view Attention Method (MVAM) for image-text matching.
We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data.
Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
arXiv Detail & Related papers (2024-02-27T06:11:54Z) - A Dual-way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking [17.847936914174543]
Multimodal Entity Linking (MEL) aims at linking ambiguous mentions with multimodal information to entity in Knowledge Graph (KG) such as Wikipedia.
We formulate multimodal entity linking as a neural text matching problem where each multimodal information (text and image) is treated as a query.
This paper introduces a dual-way enhanced (DWE) framework for MEL.
arXiv Detail & Related papers (2023-12-19T03:15:50Z) - JourneyDB: A Benchmark for Generative Image Understanding [89.02046606392382]
We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images.
Our meticulously curated dataset comprises 4 million distinct and high-quality generated images.
On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
arXiv Detail & Related papers (2023-07-03T02:39:08Z) - AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia
Content Creation [42.35572014527354]
The AToMiC dataset is designed to advance research in image/text cross-modal retrieval.
We leverage hierarchical structures and diverse domains of texts, styles, and types of images, as well as large-scale image-document associations embedded in Wikipedia.
AToMiC offers a testbed for scalable, diverse, and reproducible multimedia retrieval research.
arXiv Detail & Related papers (2023-04-04T17:11:34Z) - HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval.
First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively.
Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module.
Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.