IIITD-20K: Dense captioning for Text-Image ReID
- URL: http://arxiv.org/abs/2305.04497v1
- Date: Mon, 8 May 2023 06:46:56 GMT
- Title: IIITD-20K: Dense captioning for Text-Image ReID
- Authors: A V Subramanyam, Niranjan Sundararajan, Vibhu Dubey, Brejesh Lall
- Abstract summary: IIITD-20K comprises of 20,000 unique identities captured in the wild.
With a minimum of 26 words for a description, each image is densely captioned.
We perform elaborate experiments using state-of-art text-to-image ReID models and vision-language pre-trained models.
- Score: 5.858839403963778
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-Image (T2I) ReID has attracted a lot of attention in the recent past.
CUHK-PEDES, RSTPReid and ICFG-PEDES are the three available benchmarks to
evaluate T2I ReID methods. RSTPReid and ICFG-PEDES comprise of identities from
MSMT17 but due to limited number of unique persons, the diversity is limited.
On the other hand, CUHK-PEDES comprises of 13,003 identities but has relatively
shorter text description on average. Further, these datasets are captured in a
restricted environment with limited number of cameras. In order to further
diversify the identities and provide dense captions, we propose a novel dataset
called IIITD-20K. IIITD-20K comprises of 20,000 unique identities captured in
the wild and provides a rich dataset for text-to-image ReID. With a minimum of
26 words for a description, each image is densely captioned. We further
synthetically generate images and fine-grained captions using Stable-diffusion
and BLIP models trained on our dataset. We perform elaborate experiments using
state-of-art text-to-image ReID models and vision-language pre-trained models
and present a comprehensive analysis of the dataset. Our experiments also
reveal that synthetically generated data leads to a substantial performance
improvement in both same dataset as well as cross dataset settings. Our dataset
is available at https://bit.ly/3pkA3Rj.
Related papers
- TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - IndicSTR12: A Dataset for Indic Scene Text Recognition [33.194567434881314]
This paper proposes the largest and most comprehensive real dataset - IndicSTR12 - and benchmarking STR performance on 12 major Indian languages.
The size and complexity of the proposed dataset are comparable to those of existing Latin contemporaries.
The dataset contains over 27000 word-images gathered from various natural scenes, with over 1000 word-images for each language.
arXiv Detail & Related papers (2024-03-12T18:14:48Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
Documents [122.55393759474181]
We introduce OBELICS, an open web-scale filtered dataset of interleaved image-text documents.
We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content.
We train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks.
arXiv Detail & Related papers (2023-06-21T14:01:01Z) - CoBIT: A Contrastive Bi-directional Image-Text Generation Model [72.1700346308106]
CoBIT employs a novel unicoder-decoder structure, which attempts to unify three pre-training objectives in one framework.
CoBIT achieves superior performance in image understanding, image-text understanding (Retrieval, Captioning, VQA, SNLI-VE) and text-based content creation, particularly in zero-shot scenarios.
arXiv Detail & Related papers (2023-03-23T17:24:31Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual
Machine Learning [19.203716881791312]
We introduce the Wikipedia-based Image Text (WIT) dataset.
WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages.
WIT is the largest multimodal dataset by the number of image-text examples by 3x.
arXiv Detail & Related papers (2021-03-02T18:13:54Z) - TAP: Text-Aware Pre-training for Text-VQA and Text-Caption [75.44716665758415]
We propose Text-Aware Pre-training (TAP) for Text-VQA and Text-Caption tasks.
TAP explicitly incorporates scene text (generated from OCR engines) in pre-training.
Our approach outperforms the state of the art by large margins on multiple tasks.
arXiv Detail & Related papers (2020-12-08T18:55:21Z) - Diverse Image Captioning with Context-Object Split Latent Spaces [22.95979735707003]
We introduce a novel factorization of the latent space, termed context-object split, to model diversity in contextual descriptions across images and texts.
Our framework not only enables diverse captioning through context-based pseudo supervision, but extends this to images with novel objects and without paired captions in the training data.
arXiv Detail & Related papers (2020-11-02T13:33:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.