Understanding the World's Museums through Vision-Language Reasoning
- URL: http://arxiv.org/abs/2412.01370v1
- Date: Mon, 02 Dec 2024 10:54:31 GMT
- Title: Understanding the World's Museums through Vision-Language Reasoning
- Authors: Ada-Astrid Balauca, Sanjana Garai, Stefan Balauca, Rasesh Udayakumar Shetty, Naitik Agrawal, Dhwanil Subhashbhai Shah, Yuqian Fu, Xi Wang, Kristina Toutanova, Danda Pani Paudel, Luc Van Gool,
- Abstract summary: Museums serve as vital repositories of cultural heritage and historical artifacts spanning diverse epochs, civilizations, and regions.
We collect and curate a large-scale dataset of 65M images and 200M question-answer pairs in the standard museum catalog format for exhibits from all around the world.
We train two VLMs from different categories: the BLIP model, with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities.
- Score: 49.976422699906706
- License:
- Abstract: Museums serve as vital repositories of cultural heritage and historical artifacts spanning diverse epochs, civilizations, and regions, preserving well-documented collections. Data reveal key attributes such as age, origin, material, and cultural significance. Understanding museum exhibits from their images requires reasoning beyond visual features. In this work, we facilitate such reasoning by (a) collecting and curating a large-scale dataset of 65M images and 200M question-answer pairs in the standard museum catalog format for exhibits from all around the world; (b) training large vision-language models on the collected dataset; (c) benchmarking their ability on five visual question answering tasks. The complete dataset is labeled by museum experts, ensuring the quality as well as the practical significance of the labels. We train two VLMs from different categories: the BLIP model, with vision-language aligned embeddings, but lacking the expressive power of large language models, and the LLaVA model, a powerful instruction-tuned LLM enriched with vision-language reasoning capabilities. Through exhaustive experiments, we provide several insights on the complex and fine-grained understanding of museum exhibits. In particular, we show that some questions whose answers can often be derived directly from visual features are well answered by both types of models. On the other hand, questions that require the grounding of the visual features in repositories of human knowledge are better answered by the large vision-language models, thus demonstrating their superior capacity to perform the desired reasoning. Find our dataset, benchmarks, and source code at: https://github.com/insait-institute/Museum-65
Related papers
- WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines [74.25764182510295]
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English.
We introduce World Cuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding.
This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points.
arXiv Detail & Related papers (2024-10-16T16:11:49Z) - Towards Vision-Language Geo-Foundation Model: A Survey [65.70547895998541]
Vision-Language Foundation Models (VLFMs) have made remarkable progress on various multimodal tasks.
This paper thoroughly reviews VLGFMs, summarizing and analyzing recent developments in the field.
arXiv Detail & Related papers (2024-06-13T17:57:30Z) - CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures.
CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions.
We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z) - WinoViz: Probing Visual Properties of Objects Under Different States [39.92628807477848]
We present a text-only evaluation dataset consisting of 1,380 examples that probe the reasoning abilities of language models regarding variant visual properties of objects under different contexts or states.
Our task is challenging since it requires pragmatic reasoning (finding intended meanings) and visual knowledge reasoning.
We also present multi-hop data, a more challenging version of our data, which requires multi-step reasoning chains to solve our task.
arXiv Detail & Related papers (2024-02-21T07:31:47Z) - ViCLEVR: A Visual Reasoning Dataset and Hybrid Multimodal Fusion Model
for Visual Question Answering in Vietnamese [1.6340299456362617]
We introduce the ViCLEVR dataset, a pioneering collection for evaluating various visual reasoning capabilities in Vietnamese.
We conduct a comprehensive analysis of contemporary visual reasoning systems, offering valuable insights into their strengths and limitations.
We present PhoVIT, a comprehensive multimodal fusion that identifies objects in images based on questions.
arXiv Detail & Related papers (2023-10-27T10:44:50Z) - The All-Seeing Project: Towards Panoptic Visual Recognition and
Understanding of the Open World [71.52132776748628]
We present the All-Seeing (AS) project: a large-scale data and model for recognizing and understanding everything in the open world.
We create a new dataset (AS-1B) with over 1 billion regions annotated with semantic tags, question-answering pairs, and detailed captions.
We develop the All-Seeing model (ASM), a unified framework for panoptic visual recognition and understanding.
arXiv Detail & Related papers (2023-08-03T17:59:47Z) - Can Pre-trained Vision and Language Models Answer Visual
Information-Seeking Questions? [50.29862466940209]
We introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions.
We analyze various pre-trained visual question answering models and gain insights into their characteristics.
We show that accurate visual entity recognition can be used to improve performance on InfoSeek by retrieving relevant documents.
arXiv Detail & Related papers (2023-02-23T00:33:54Z) - VIPHY: Probing "Visible" Physical Commonsense Knowledge [22.00069189468524]
Vision-language models (VLMs) have shown remarkable performance on visual reasoning tasks.
We evaluate their ability to acquire "visible" physical knowledge.
Our results indicate a severe gap between model and human performance.
arXiv Detail & Related papers (2022-09-15T02:06:25Z) - FashionVQA: A Domain-Specific Visual Question Answering System [2.6924405243296134]
We train a visual question answering (VQA) system to answer complex natural language questions about apparel in fashion photoshoot images.
The accuracy of the best model surpasses the human expert level, even when answering human-generated questions.
Our approach for generating a large-scale multimodal domain-specific dataset provides a path for training specialized models capable of communicating in natural language.
arXiv Detail & Related papers (2022-08-24T01:18:13Z) - VisualMRC: Machine Reading Comprehension on Document Images [4.057968826847943]
Given a question and a document image, a machine reads and comprehends texts in the image to answer the question in natural language.
VisualMRC focuses more on developing natural language understanding and generation abilities.
It contains 30,000+ pairs of a question and an abstractive answer for 10,000+ document images sourced from multiple domains of webpages.
arXiv Detail & Related papers (2021-01-27T09:03:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.