AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia
Content Creation
- URL: http://arxiv.org/abs/2304.01961v1
- Date: Tue, 4 Apr 2023 17:11:34 GMT
- Title: AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia
Content Creation
- Authors: Jheng-Hong Yang, Carlos Lassance, Rafael Sampaio de Rezende, Krishna
Srinivasan, Miriam Redi, St\'ephane Clinchant, Jimmy Lin
- Abstract summary: The AToMiC dataset is designed to advance research in image/text cross-modal retrieval.
We leverage hierarchical structures and diverse domains of texts, styles, and types of images, as well as large-scale image-document associations embedded in Wikipedia.
AToMiC offers a testbed for scalable, diverse, and reproducible multimedia retrieval research.
- Score: 42.35572014527354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents the AToMiC (Authoring Tools for Multimedia Content)
dataset, designed to advance research in image/text cross-modal retrieval.
While vision-language pretrained transformers have led to significant
improvements in retrieval effectiveness, existing research has relied on
image-caption datasets that feature only simplistic image-text relationships
and underspecified user models of retrieval tasks. To address the gap between
these oversimplified settings and real-world applications for multimedia
content creation, we introduce a new approach for building retrieval test
collections. We leverage hierarchical structures and diverse domains of texts,
styles, and types of images, as well as large-scale image-document associations
embedded in Wikipedia. We formulate two tasks based on a realistic user model
and validate our dataset through retrieval experiments using baseline models.
AToMiC offers a testbed for scalable, diverse, and reproducible multimedia
retrieval research. Finally, the dataset provides the basis for a dedicated
track at the 2023 Text Retrieval Conference (TREC), and is publicly available
at https://github.com/TREC-AToMiC/AToMiC.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models [17.171715290673678]
We propose an interactive image retrieval system capable of refining queries based on user relevance feedback.
This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries.
To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task.
arXiv Detail & Related papers (2024-04-29T14:46:35Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - DUBLIN -- Document Understanding By Language-Image Network [37.42637168606938]
We propose DUBLIN, which is pretrained on web pages using three novel objectives.
We show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset.
We also achieve competitive performance on RVL-CDIP document classification.
arXiv Detail & Related papers (2023-05-23T16:34:09Z) - EDIS: Entity-Driven Image Search over Multimodal Web Content [95.40238328527931]
We introduce textbfEntity-textbfDriven textbfImage textbfSearch (EDIS), a dataset for cross-modal image search in the news domain.
EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description.
arXiv Detail & Related papers (2023-05-23T02:59:19Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.