Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia
Image-Caption Matching
- URL: http://arxiv.org/abs/2206.10436v1
- Date: Tue, 21 Jun 2022 14:30:14 GMT
- Title: Transformer-Based Multi-modal Proposal and Re-Rank for Wikipedia
Image-Caption Matching
- Authors: Nicola Messina, Davide Alessandro Coccomini, Andrea Esuli, Fabrizio
Falchi
- Abstract summary: We present the system we designed for participating in the Wikipedia Image-Caption Matching challenge on Kaggle.
Our approach achieves remarkable results, obtaining a normalized Discounted Cumulative Gain (nDCG) value of 0.53 on the private leaderboard of the Kaggle challenge.
- Score: 9.56339585008373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the increased accessibility of web and online encyclopedias, the amount
of data to manage is constantly increasing. In Wikipedia, for example, there
are millions of pages written in multiple languages. These pages contain images
that often lack the textual context, remaining conceptually floating and
therefore harder to find and manage. In this work, we present the system we
designed for participating in the Wikipedia Image-Caption Matching challenge on
Kaggle, whose objective is to use data associated with images (URLs and visual
data) to find the correct caption among a large pool of available ones. A
system able to perform this task would improve the accessibility and
completeness of multimedia content on large online encyclopedias. Specifically,
we propose a cascade of two models, both powered by the recent Transformer
model, able to efficiently and effectively infer a relevance score between the
query image data and the captions. We verify through extensive experimentation
that the proposed two-model approach is an effective way to handle a large pool
of images and captions while maintaining bounded the overall computational
complexity at inference time. Our approach achieves remarkable results,
obtaining a normalized Discounted Cumulative Gain (nDCG) value of 0.53 on the
private leaderboard of the Kaggle challenge.
Related papers
- Text Data-Centric Image Captioning with Interactive Prompts [20.48013600818985]
Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data.
This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap.
arXiv Detail & Related papers (2024-03-28T07:43:49Z) - xT: Nested Tokenization for Larger Context in Large Images [79.37673340393475]
xT is a framework for vision transformers which aggregates global context with local details.
We are able to increase accuracy by up to 8.6% on challenging classification tasks.
arXiv Detail & Related papers (2024-03-04T10:29:58Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Image Captioning with Multi-Context Synthetic Data [16.961112970612447]
Large models have excelled in producing high-quality images and text.
We present an innovative pipeline that introduces multi-context data generation.
Our model is exclusively trained on synthetic image-text pairs crafted through this process.
arXiv Detail & Related papers (2023-05-29T13:18:59Z) - MuRAG: Multimodal Retrieval-Augmented Generator for Open Question
Answering over Images and Text [58.655375327681774]
We propose the first Multimodal Retrieval-Augmented Transformer (MuRAG)
MuRAG accesses an external non-parametric multimodal memory to augment language generation.
Our results show that MuRAG achieves state-of-the-art accuracy, outperforming existing models by 10-20% absolute on both datasets.
arXiv Detail & Related papers (2022-10-06T13:58:03Z) - ALADIN: Distilling Fine-grained Alignment Scores for Efficient
Image-Text Matching and Retrieval [51.588385824875886]
Cross-modal retrieval consists in finding images related to a given query text or vice-versa.
Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks.
This paper proposes an ALign And DIstill Network (ALADIN) to fill in the gap between effectiveness and efficiency.
arXiv Detail & Related papers (2022-07-29T16:01:48Z) - Exploring Semantic Relationships for Unpaired Image Captioning [40.401322131624866]
We achieve unpaired image captioning by bridging the vision and the language domains with high-level semantic information.
We propose the Semantic Relationship Explorer, which explores the relationships between semantic concepts for better understanding of the image.
The proposed approach boosts five strong baselines under the paired setting, where the most significant improvement in CIDEr score reaches 8%.
arXiv Detail & Related papers (2021-06-20T09:10:11Z) - Length-Controllable Image Captioning [67.2079793803317]
We propose to use a simple length level embedding to endow them with this ability.
Due to their autoregressive nature, the computational complexity of existing models increases linearly as the length of the generated captions grows.
We further devise a non-autoregressive image captioning approach that can generate captions in a length-irrelevant complexity.
arXiv Detail & Related papers (2020-07-19T03:40:51Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Multi-Image Summarization: Textual Summary from a Set of Cohesive Images [17.688344968462275]
This paper proposes the new task of multi-image summarization.
It aims to generate a concise and descriptive textual summary given a coherent set of input images.
A dense average image feature aggregation network allows the model to focus on a coherent subset of attributes.
arXiv Detail & Related papers (2020-06-15T18:45:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.