AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia
Content Creation
- URL: http://arxiv.org/abs/2304.01961v1
- Date: Tue, 4 Apr 2023 17:11:34 GMT
- Title: AToMiC: An Image/Text Retrieval Test Collection to Support Multimedia
Content Creation
- Authors: Jheng-Hong Yang, Carlos Lassance, Rafael Sampaio de Rezende, Krishna
Srinivasan, Miriam Redi, St\'ephane Clinchant, Jimmy Lin
- Abstract summary: The AToMiC dataset is designed to advance research in image/text cross-modal retrieval.
We leverage hierarchical structures and diverse domains of texts, styles, and types of images, as well as large-scale image-document associations embedded in Wikipedia.
AToMiC offers a testbed for scalable, diverse, and reproducible multimedia retrieval research.
- Score: 42.35572014527354
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents the AToMiC (Authoring Tools for Multimedia Content)
dataset, designed to advance research in image/text cross-modal retrieval.
While vision-language pretrained transformers have led to significant
improvements in retrieval effectiveness, existing research has relied on
image-caption datasets that feature only simplistic image-text relationships
and underspecified user models of retrieval tasks. To address the gap between
these oversimplified settings and real-world applications for multimedia
content creation, we introduce a new approach for building retrieval test
collections. We leverage hierarchical structures and diverse domains of texts,
styles, and types of images, as well as large-scale image-document associations
embedded in Wikipedia. We formulate two tasks based on a realistic user model
and validate our dataset through retrieval experiments using baseline models.
AToMiC offers a testbed for scalable, diverse, and reproducible multimedia
retrieval research. Finally, the dataset provides the basis for a dedicated
track at the 2023 Text Retrieval Conference (TREC), and is publicly available
at https://github.com/TREC-AToMiC/AToMiC.
Related papers
- Unified Text-to-Image Generation and Retrieval [96.72318842152148]
We propose a unified framework in the context of Multimodal Large Language Models (MLLMs)
We first explore the intrinsic discrimi abilities of MLLMs and introduce a generative retrieval method to perform retrieval in a training-free manner.
We then unify generation and retrieval in an autoregressive generation way and propose an autonomous decision module to choose the best-matched one between generated and retrieved images.
arXiv Detail & Related papers (2024-06-09T15:00:28Z) - Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models [17.171715290673678]
We propose an interactive image retrieval system capable of refining queries based on user relevance feedback.
This system incorporates a vision language model (VLM) based image captioner to enhance the quality of text-based queries.
To evaluate our system, we curate a new dataset by adapting the MSR-VTT video retrieval dataset to the image retrieval task.
arXiv Detail & Related papers (2024-04-29T14:46:35Z) - LayoutLLM: Large Language Model Instruction Tuning for Visually Rich Document Understanding [0.0]
This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents.
Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure.
Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
arXiv Detail & Related papers (2024-03-21T09:25:24Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - DUBLIN -- Document Understanding By Language-Image Network [37.42637168606938]
We propose DUBLIN, which is pretrained on web pages using three novel objectives.
We show that DUBLIN is the first pixel-based model to achieve an EM of 77.75 and F1 of 84.25 on the WebSRC dataset.
We also achieve competitive performance on RVL-CDIP document classification.
arXiv Detail & Related papers (2023-05-23T16:34:09Z) - EDIS: Entity-Driven Image Search over Multimodal Web Content [95.40238328527931]
We introduce textbfEntity-textbfDriven textbfImage textbfSearch (EDIS), a dataset for cross-modal image search in the news domain.
EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description.
arXiv Detail & Related papers (2023-05-23T02:59:19Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - One-shot Key Information Extraction from Document with Deep Partial
Graph Matching [60.48651298832829]
Key Information Extraction (KIE) from documents improves efficiency, productivity, and security in many industrial scenarios.
Existing supervised learning methods for the KIE task need to feed a large number of labeled samples and learn separate models for different types of documents.
We propose a deep end-to-end trainable network for one-shot KIE using partial graph matching.
arXiv Detail & Related papers (2021-09-26T07:45:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.