Related papers: EDIS: Entity-Driven Image Search over Multimodal Web Content

EDIS: Entity-Driven Image Search over Multimodal Web Content

URL: http://arxiv.org/abs/2305.13631v2
Date: Mon, 23 Oct 2023 05:42:51 GMT
Title: EDIS: Entity-Driven Image Search over Multimodal Web Content
Authors: Siqi Liu, Weixi Feng, Tsu-jui Fu, Wenhu Chen, William Yang Wang
Abstract summary: We introduce textbfEntity-textbfDriven textbfImage textbfSearch (EDIS), a dataset for cross-modal image search in the news domain. EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description.
Score: 95.40238328527931
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Making image retrieval methods practical for real-world search applications requires significant progress in dataset scales, entity comprehension, and multimodal information fusion. In this work, we introduce \textbf{E}ntity-\textbf{D}riven \textbf{I}mage \textbf{S}earch (EDIS), a challenging dataset for cross-modal image search in the news domain. EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description. Unlike datasets that assume a small set of single-modality candidates, EDIS reflects real-world web image search scenarios by including a million multimodal image-text pairs as candidates. EDIS encourages the development of retrieval models that simultaneously address cross-modal information fusion and matching. To achieve accurate ranking results, a model must: 1) understand named entities and events from text queries, 2) ground entities onto images or text descriptions, and 3) effectively fuse textual and visual representations. Our experimental results show that EDIS challenges state-of-the-art methods with dense entities and a large-scale candidate set. The ablation study also proves that fusing textual features with visual features is critical in improving retrieval results.

Related papers

Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data. We introduce MMTabQA, a new dataset designed for this purpose. Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z)
Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs) We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner. We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z)
MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching [65.87255122130188]
We propose a Multi-view Attention Method (MVAM) for image-text matching. We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data. Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
arXiv Detail & Related papers (2024-02-27T06:11:54Z)
A Dual-way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking [17.847936914174543]
Multimodal Entity Linking (MEL) aims at linking ambiguous mentions with multimodal information to entity in Knowledge Graph (KG) such as Wikipedia. We formulate multimodal entity linking as a neural text matching problem where each multimodal information (text and image) is treated as a query. This paper introduces a dual-way enhanced (DWE) framework for MEL.
arXiv Detail & Related papers (2023-12-19T03:15:50Z)
JourneyDB: A Benchmark for Generative Image Understanding [89.02046606392382]
We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images. Our meticulously curated dataset comprises 4 million distinct and high-quality generated images. On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
arXiv Detail & Related papers (2023-07-03T02:39:08Z)
AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia Content Creation [42.35572014527354]
The AToMiC dataset is designed to advance research in image/text cross-modal retrieval. We leverage hierarchical structures and diverse domains of texts, styles, and types of images, as well as large-scale image-document associations embedded in Wikipedia. AToMiC offers a testbed for scalable, diverse, and reproducible multimedia retrieval research.
arXiv Detail & Related papers (2023-04-04T17:11:34Z)
HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval. First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively. Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module. Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z)
Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE. We propose a novel Multi-modal Retrieval based framework (MoRe) MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z)
Progressive Learning for Image Retrieval with Hybrid-Modality Queries [48.79599320198615]
Image retrieval with hybrid-modality queries, also known as composing text and image for image retrieval (CTI-IR) We decompose the CTI-IR task into a three-stage learning problem to progressively learn the complex knowledge for image retrieval with hybrid-modality queries. Our proposed model significantly outperforms state-of-the-art methods in the mean of Recall@K by 24.9% and 9.5% on the Fashion-IQ and Shoes benchmark datasets respectively.
arXiv Detail & Related papers (2022-04-24T08:10:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.