ImageInWords: Unlocking Hyper-Detailed Image Descriptions
- URL: http://arxiv.org/abs/2405.02793v2
- Date: Mon, 28 Oct 2024 21:15:36 GMT
- Title: ImageInWords: Unlocking Hyper-Detailed Image Descriptions
- Authors: Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bunner, Ranjay Krishna, Jason Baldridge, Radu Soricut,
- Abstract summary: ImageInWords (IIW) is a human-in-the-loop framework for curating hyper-detailed image descriptions.
We show major gains compared to recent datasets in comprehensiveness, specificity, hallucinations, and more.
We also show that fine-tuning with IIW data improves these metrics by +31% against models trained with prior work, even with only 9k samples.
- Score: 36.373619800014275
- License:
- Abstract: Despite the longstanding adage "an image is worth a thousand words," generating accurate hyper-detailed image descriptions remains unsolved. Trained on short web-scraped image text, vision-language models often generate incomplete descriptions with visual inconsistencies. We address this via a novel data-centric approach with ImageInWords (IIW), a carefully designed human-in-the-loop framework for curating hyper-detailed image descriptions. Human evaluations on IIW data show major gains compared to recent datasets (+66%) and GPT4V (+48%) across comprehensiveness, specificity, hallucinations, and more. We also show that fine-tuning with IIW data improves these metrics by +31% against models trained with prior work, even with only 9k samples. Lastly, we evaluate IIW models with text-to-image generation and vision-language reasoning tasks. Our generated descriptions result in the highest fidelity images, and boost compositional reasoning by up to 6% on ARO, SVO-Probes, and Winoground datasets. We release the IIW Eval benchmark with human judgement labels, object and image-level annotations from our framework, and existing image caption datasets enriched via IIW-model.
Related papers
- VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - Image Textualization: An Automatic Framework for Creating Accurate and Detailed Image Descriptions [30.08331098481379]
We propose an innovative framework termed Image Textualization (IT)
IT automatically produces high-quality image descriptions by leveraging existing multi-modal large language models (MLLMs) and multiple vision expert models.
We show that LLaVA-7B, benefiting from training on IT-curated descriptions, acquire improved capability to generate richer image descriptions.
arXiv Detail & Related papers (2024-06-11T17:37:45Z) - DOCCI: Descriptions of Connected and Contrasting Images [58.377060316967864]
Descriptions of Connected and Contrasting Images (DOCCI) is a dataset with long, human-annotated English descriptions for 15k images.
We instruct human annotators to create comprehensive descriptions for each image.
We show that DOCCI is a useful testbed for text-to-image generation.
arXiv Detail & Related papers (2024-04-30T17:56:24Z) - Learning from Models and Data for Visual Grounding [55.21937116752679]
We introduce SynGround, a framework that combines data-driven learning and knowledge transfer from various large-scale pretrained models.
We finetune a pretrained vision-and-language model on this dataset by optimizing a mask-attention objective.
The resulting model improves the grounding capabilities of an off-the-shelf vision-and-language model.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - Enhancing Vision-Language Pre-training with Rich Supervisions [60.269564094889446]
We propose Strongly Supervised pre-training with ScreenShots (S4)
S4 is a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering.
We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks.
arXiv Detail & Related papers (2024-03-05T22:14:58Z) - Paragraph-to-Image Generation with Information-Enriched Diffusion Model [67.9265336953134]
ParaDiffusion is an information-enriched diffusion model for paragraph-to-image generation task.
It delves into the transference of the extensive semantic comprehension capabilities of large language models to the task of image generation.
The code and dataset will be released to foster community research on long-text alignment.
arXiv Detail & Related papers (2023-11-24T05:17:01Z) - ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations [43.323791505213634]
ASPIRE (Language-guided Data Augmentation for SPurIous correlation REmoval) is a solution for supplementing the training dataset with images without spurious features.
It can generate non-spurious images without requiring any group labeling or existing non-spurious images in the training set.
It improves the worst-group classification accuracy of prior methods by 1% - 38%.
arXiv Detail & Related papers (2023-08-19T20:18:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.