Explainable Search and Discovery of Visual Cultural Heritage Collections with Multimodal Large Language Models
- URL: http://arxiv.org/abs/2411.04663v1
- Date: Thu, 07 Nov 2024 12:48:39 GMT
- Title: Explainable Search and Discovery of Visual Cultural Heritage Collections with Multimodal Large Language Models
- Authors: Taylor Arnold, Lauren Tilton,
- Abstract summary: We introduce a method for using state-of-the-art multimodal large language models (LLMs) to enable an open-ended, explainable search and discovery interface for visual collections.
We show how our approach can create novel clustering and recommendation systems that avoid common pitfalls of methods based directly on visual embeddings.
- Score: 0.0
- License:
- Abstract: Many cultural institutions have made large digitized visual collections available online, often under permissible re-use licences. Creating interfaces for exploring and searching these collections is difficult, particularly in the absence of granular metadata. In this paper, we introduce a method for using state-of-the-art multimodal large language models (LLMs) to enable an open-ended, explainable search and discovery interface for visual collections. We show how our approach can create novel clustering and recommendation systems that avoid common pitfalls of methods based directly on visual embeddings. Of particular interest is the ability to offer concrete textual explanations of each recommendation without the need to preselect the features of interest. Together, these features can create a digital interface that is more open-ended and flexible while also being better suited to addressing privacy and ethical concerns. Through a case study using a collection of documentary photographs, we provide several metrics showing the efficacy and possibilities of our approach.
Related papers
- MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval [57.891157692501345]
$textbfMultiVENT 2.0$ is a large-scale, multilingual event-centric video retrieval benchmark.
It features a collection of more than 218,000 news videos and 3,906 queries targeting specific world events.
Preliminary results show that state-of-the-art vision-language models struggle significantly with this task.
arXiv Detail & Related papers (2024-10-15T13:56:34Z) - A Survey of Multimodal Composite Editing and Retrieval [7.966265020507201]
This survey is the first comprehensive review of the literature on multimodal composite retrieval.
It covers image-text composite editing, image-text composite retrieval, and other multimodal composite retrieval.
We systematically organize the application scenarios, methods, benchmarks, experiments, and future directions.
arXiv Detail & Related papers (2024-09-09T08:06:50Z) - Leveraging Large Language Models for Multimodal Search [0.6249768559720121]
This paper introduces a novel multimodal search model that achieves a new performance milestone on the Fashion200K dataset.
We also propose a novel search interface integrating Large Language Models (LLMs) to facilitate natural language interaction.
arXiv Detail & Related papers (2024-04-24T10:30:42Z) - Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering [8.447067012487866]
Multi-MaP is a novel method employing a multi-modal proxy learning process.
It captures a user's interest via a keyword but also facilitates identifying relevant clusterings.
Our experiments show that Multi-MaP consistently outperforms state-of-the-art methods in all benchmark multi-clustering vision tasks.
arXiv Detail & Related papers (2024-04-24T05:20:42Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - Open Visual Knowledge Extraction via Relation-Oriented Multimodality
Model Prompting [89.95541601837719]
We take a first exploration to a new paradigm of open visual knowledge extraction.
OpenVik consists of an open relational region detector to detect regions potentially containing relational knowledge.
A visual knowledge generator that generates format-free knowledge by prompting the large multimodality model with the detected region of interest.
arXiv Detail & Related papers (2023-10-28T20:09:29Z) - Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
Detection [72.36017150922504]
We propose a multi-modal contextual knowledge distillation framework, MMC-Det, to transfer the learned contextual knowledge from a teacher fusion transformer to a student detector.
The diverse multi-modal masked language modeling is realized by an object divergence constraint upon traditional multi-modal masked language modeling (MLM)
arXiv Detail & Related papers (2023-08-30T08:33:13Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - Object Retrieval and Localization in Large Art Collections using Deep
Multi-Style Feature Fusion and Iterative Voting [10.807131260367298]
We introduce an algorithm that allows users to search for image regions containing specific motifs or objects.
Our region-based voting with GPU-accelerated approximate nearest-neighbour search allows us to find and localize even small motifs within an extensive dataset in a few seconds.
arXiv Detail & Related papers (2021-07-14T18:40:49Z) - A unified framework based on graph consensus term for multi-view
learning [5.168659132277719]
We propose a novel multi-view learning framework, which aims to leverage most existing graph embedding works into a unified formula.
Our method explores the graph structure in each view independently to preserve the diversity property of graph embedding methods.
To this end, the diversity and complementary information among different views could be simultaneously considered.
arXiv Detail & Related papers (2021-05-25T09:22:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.