Temporal Image Caption Retrieval Competition -- Description and Results
- URL: http://arxiv.org/abs/2410.06314v1
- Date: Tue, 8 Oct 2024 19:45:53 GMT
- Title: Temporal Image Caption Retrieval Competition -- Description and Results
- Authors: Jakub Pokrywka, Piotr Wierzchoń, Kornel Weryszko, Krzysztof Jassem,
- Abstract summary: This paper addresses the multimodal challenge of Text-Image retrieval and introduces a novel task that extends the modalities to include temporal data.
The Temporal Image Caption Retrieval Competition (TICRC) presented in this paper is based on the Chronicling America and Challenging America projects, which offer access to an extensive collection of digitized historic American newspapers spanning 274 years.
- Score: 0.9999629695552195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal models, which combine visual and textual information, have recently gained significant recognition. This paper addresses the multimodal challenge of Text-Image retrieval and introduces a novel task that extends the modalities to include temporal data. The Temporal Image Caption Retrieval Competition (TICRC) presented in this paper is based on the Chronicling America and Challenging America projects, which offer access to an extensive collection of digitized historic American newspapers spanning 274 years. In addition to the competition results, we provide an analysis of the delivered dataset and the process of its creation.
Related papers
- TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval [120.49126407479717]
We introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models.
Our system extends to novel applications in composed image retrieval, domain transfer, and fine-grained generation.
arXiv Detail & Related papers (2024-03-12T00:27:18Z) - Semi-supervised multimodal coreference resolution in image narrations [44.66334603518387]
We study multimodal coreference resolution, specifically where a descriptive text is paired with an image.
This poses significant challenges due to fine-grained image-text alignment, inherent ambiguity present in narrative language, and unavailability of large annotated training sets.
We present a data efficient semi-supervised approach that utilizes image-narration pairs to resolve coreferences and narrative grounding in a multimodal context.
arXiv Detail & Related papers (2023-10-20T16:10:14Z) - NICE: CVPR 2023 Challenge on Zero-shot Image Captioning [149.28330263581012]
NICE project is designed to challenge the computer vision community to develop robust image captioning models.
Report includes information on the newly proposed NICE dataset, evaluation methods, challenge results, and technical details of top-ranking entries.
arXiv Detail & Related papers (2023-09-05T05:32:19Z) - Out-of-Vocabulary Challenge Report [15.827931962904115]
The Out-Of-Vocabulary 2022 (OOV) challenge introduces the recognition of unseen scene text instances at training time.
The competition compiles a collection of public scene text datasets comprising of 326,385 images with 4,864,405 scene text instances.
A thorough analysis of results from baselines and different participants is presented.
arXiv Detail & Related papers (2022-09-14T15:25:54Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Deep Image Deblurring: A Survey [165.32391279761006]
Deblurring is a classic problem in low-level computer vision, which aims to recover a sharp image from a blurred input image.
Recent advances in deep learning have led to significant progress in solving this problem.
arXiv Detail & Related papers (2022-01-26T01:31:30Z) - From Show to Tell: A Survey on Image Captioning [48.98681267347662]
Connecting Vision and Language plays an essential role in Generative Intelligence.
Research in image captioning has not reached a conclusive answer yet.
This work aims at providing a comprehensive overview and categorization of image captioning approaches.
arXiv Detail & Related papers (2021-07-14T18:00:54Z) - ICFHR 2020 Competition on Image Retrieval for Historical Handwritten
Fragments [11.154300222718879]
This competition succeeds upon a line of competitions for writer and style analysis of historical document images.
Although the most teams submitted methods based on convolutional neural networks, the winning entry achieves an mAP below 40%.
arXiv Detail & Related papers (2020-10-20T11:12:35Z) - The Newspaper Navigator Dataset: Extracting And Analyzing Visual Content
from 16 Million Historic Newspaper Pages in Chronicling America [10.446473806802578]
We introduce a visual content recognition model trained on bounding box annotations of photographs, illustrations, maps, comics, and editorial cartoons.
We describe our pipeline that utilizes this deep learning model to extract 7 classes of visual content.
We report the results of running the pipeline on 16.3 million pages from the Chronicling America corpus.
arXiv Detail & Related papers (2020-05-04T15:51:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.