Related papers: Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution

Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution

URL: http://arxiv.org/abs/2505.10921v1
Date: Fri, 16 May 2025 06:52:46 GMT
Title: Towards Cross-modal Retrieval in Chinese Cultural Heritage Documents: Dataset and Solution
Authors: Junyi Yuan, Jian Zhang, Fangyu Wu, Dongming Lu, Huanda Lu, Qiufeng Wang,
Abstract summary: Cross-modal retrieval plays a pivotal role in understanding and interpreting Chinese cultural heritage.<n>There is a lack of specialized datasets dedicated to Chinese cultural heritage.<n>We propose LACLIP, a training-free local alignment strategy built upon a fine-tuned Chinese-CLIP.
Score: 13.303788484531779
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: China has a long and rich history, encompassing a vast cultural heritage that includes diverse multimodal information, such as silk patterns, Dunhuang murals, and their associated historical narratives. Cross-modal retrieval plays a pivotal role in understanding and interpreting Chinese cultural heritage by bridging visual and textual modalities to enable accurate text-to-image and image-to-text retrieval. However, despite the growing interest in multimodal research, there is a lack of specialized datasets dedicated to Chinese cultural heritage, limiting the development and evaluation of cross-modal learning models in this domain. To address this gap, we propose a multimodal dataset named CulTi, which contains 5,726 image-text pairs extracted from two series of professional documents, respectively related to ancient Chinese silk and Dunhuang murals. Compared to existing general-domain multimodal datasets, CulTi presents a challenge for cross-modal retrieval: the difficulty of local alignment between intricate decorative motifs and specialized textual descriptions. To address this challenge, we propose LACLIP, a training-free local alignment strategy built upon a fine-tuned Chinese-CLIP. LACLIP enhances the alignment of global textual descriptions with local visual regions by computing weighted similarity scores during inference. Experimental results on CulTi demonstrate that LACLIP significantly outperforms existing models in cross-modal retrieval, particularly in handling fine-grained semantic associations within Chinese cultural heritage.

Related papers

Hanfu-Bench: A Multimodal Benchmark on Cross-Temporal Cultural Understanding and Transcreation [34.186793081759525]
Hanfu-Bench is a novel, expert-curated multimodal dataset.<n>It comprises two core tasks: cultural visual understanding and cultural image transcreation.<n>Our evaluation shows that closed VLMs perform comparably to non-experts on visual cutural understanding but fall short by 10% to human experts.<n>For the transcreation task, multi-faceted human evaluation indicates that the best-performing model achieves a success rate of only 42%.
arXiv Detail & Related papers (2025-06-02T11:43:46Z)
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding [79.44246283490665]
We introduce RAVENEA, a new benchmark designed to advance visual culture understanding through retrieval.<n>RAVENEA focuses on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC)<n>We train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art vision-language models.
arXiv Detail & Related papers (2025-05-20T14:57:16Z)
TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs [13.069833806549914]
TCC-Bench is a benchmark for assessing the understanding of traditional Chinese culture by MLLMs.<n>It comprises culturally rich and visually diverse data, incorporating images from museum artifacts, everyday life scenes, comics, and other culturally significant contexts.<n>We adopt a semi-automated pipeline that utilizes GPT-4o in text-only mode to generate candidate questions, followed by human curation to ensure data quality and avoid potential data leakage.
arXiv Detail & Related papers (2025-05-16T14:10:41Z)
CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization [50.90288681622152]
Large Language Models (LLMs) more deeply integrate into human life across various regions.<n>Existing approaches develop culturally aligned LLMs through fine-tuning with culture-specific corpora.<n>We introduce CAReDiO, a novel cultural data construction framework.
arXiv Detail & Related papers (2025-04-09T13:40:13Z)
Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models [22.92083941222383]
We introduce DalleStreet, a large-scale dataset generated by DALL-E 3 and validated by humans. We find disparities in cultural understanding at geographic sub-region levels with both open-source (LLaVA) and closed-source (GPT-4V) models. Our findings reveal a nuanced picture of the cultural competence of LMMs, highlighting the need to develop culture-aware systems.
arXiv Detail & Related papers (2024-07-02T08:55:41Z)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures. CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z)
CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models [59.22460740026037]
"CIVICS: Culturally-Informed & Values-Inclusive Corpus for Societal impacts" dataset is designed to evaluate the social and cultural variation of Large Language Models (LLMs) We create a hand-crafted, multilingual dataset of value-laden prompts which address specific socially sensitive topics, including LGBTQI rights, social welfare, immigration, disability rights, and surrogacy.
arXiv Detail & Related papers (2024-05-22T20:19:10Z)
CHisIEC: An Information Extraction Corpus for Ancient Chinese History [12.41912979618724]
We present the Chinese Historical Information Extraction Corpus''(CHis IEC) dataset. CHis IEC is a meticulously curated dataset designed to develop and evaluate NER and RE tasks. The dataset encompasses four distinct entity types and twelve relation types, resulting in a meticulously labeled dataset.
arXiv Detail & Related papers (2024-03-22T10:12:10Z)
Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking [48.21982147529661]
This paper introduces a novel approach for massively multicultural knowledge acquisition. Our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages. Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI.
arXiv Detail & Related papers (2024-02-14T18:16:54Z)
Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems [64.40789703661987]
Multi3WOZ is a novel multilingual, multi-domain, multi-parallel ToD dataset. It is large-scale and offers culturally adapted dialogs in 4 languages. We describe a complex bottom-up data collection process that yielded the final dataset.
arXiv Detail & Related papers (2023-07-26T08:29:42Z)
ScrollTimes: Tracing the Provenance of Paintings as a Window into History [35.605930297790465]
The study of cultural artifact provenance, tracing ownership and preservation, holds significant importance in archaeology and art history. In collaboration with art historians, we examined the handscroll, a traditional Chinese painting form that provides a rich source of historical data. We present a three-tiered methodology encompassing artifact, contextual, and provenance levels, designed to create a "Biography" for handscroll.
arXiv Detail & Related papers (2023-06-15T03:38:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.