JourneyDB: A Benchmark for Generative Image Understanding
- URL: http://arxiv.org/abs/2307.00716v2
- Date: Sat, 28 Oct 2023 11:46:07 GMT
- Title: JourneyDB: A Benchmark for Generative Image Understanding
- Authors: Keqiang Sun, Junting Pan, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu,
Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, Limin
Wang, Hongsheng Li
- Abstract summary: We introduce a comprehensive dataset, referred to as JourneyDB, that caters to the domain of generative images.
Our meticulously curated dataset comprises 4 million distinct and high-quality generated images.
On our dataset, we have devised four benchmarks to assess the performance of generated image comprehension.
- Score: 89.02046606392382
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While recent advancements in vision-language models have had a transformative
impact on multi-modal comprehension, the extent to which these models possess
the ability to comprehend generated images remains uncertain. Synthetic images,
in comparison to real data, encompass a higher level of diversity in terms of
both content and style, thereby presenting significant challenges for the
models to fully grasp. In light of this challenge, we introduce a comprehensive
dataset, referred to as JourneyDB, that caters to the domain of generative
images within the context of multi-modal visual understanding. Our meticulously
curated dataset comprises 4 million distinct and high-quality generated images,
each paired with the corresponding text prompts that were employed in their
creation. Furthermore, we additionally introduce an external subset with
results of another 22 text-to-image generative models, which makes JourneyDB a
comprehensive benchmark for evaluating the comprehension of generated images.
On our dataset, we have devised four benchmarks to assess the performance of
generated image comprehension in relation to both content and style
interpretation. These benchmarks encompass prompt inversion, style retrieval,
image captioning, and visual question answering. Lastly, we evaluate the
performance of state-of-the-art multi-modal models when applied to the
JourneyDB dataset, providing a comprehensive analysis of their strengths and
limitations in comprehending generated content. We anticipate that the proposed
dataset and benchmarks will facilitate further research in the field of
generative content understanding. The dataset is publicly available at
https://journeydb.github.io.
Related papers
- Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment [53.45813302866466]
We present ISG, a comprehensive evaluation framework for interleaved text-and-image generation.
ISG evaluates responses on four levels of granularity: holistic, structural, block-level, and image-specific.
In conjunction with ISG, we introduce a benchmark, ISG-Bench, encompassing 1,150 samples across 8 categories and 21 subcategories.
arXiv Detail & Related papers (2024-11-26T07:55:57Z) - Knowledge-Aware Reasoning over Multimodal Semi-structured Tables [85.24395216111462]
This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data.
We introduce MMTabQA, a new dataset designed for this purpose.
Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs.
arXiv Detail & Related papers (2024-08-25T15:17:43Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model
for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese.
We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations.
We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z) - EDIS: Entity-Driven Image Search over Multimodal Web Content [95.40238328527931]
We introduce textbfEntity-textbfDriven textbfImage textbfSearch (EDIS), a dataset for cross-modal image search in the news domain.
EDIS consists of 1 million web images from actual search engine results and curated datasets, with each image paired with a textual description.
arXiv Detail & Related papers (2023-05-23T02:59:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.