Related papers: EgMM-Corpus: A Multimodal Vision-Language Dataset for Egyptian Culture

EgMM-Corpus: A Multimodal Vision-Language Dataset for Egyptian Culture

URL: http://arxiv.org/abs/2510.16198v1
Date: Fri, 17 Oct 2025 20:21:01 GMT
Title: EgMM-Corpus: A Multimodal Vision-Language Dataset for Egyptian Culture
Authors: Mohamed Gamil, Abdelrahman Elsayed, Abdelrahman Lila, Ahmed Gad, Hesham Abdelgawad, Mohamed Aref, Ahmed Fares,
Abstract summary: We introduce EgMM-Corpus, a multimodal dataset dedicated to Egyptian culture.<n>Each entry in the dataset is manually validated for cultural authenticity and multimodal coherence.<n>We evaluate the zero-shot performance of Contrastive Language-Image Pre-training CLIP on EgMM-Corpus.
Score: 1.0170138197592686
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Despite recent advances in AI, multimodal culturally diverse datasets are still limited, particularly for regions in the Middle East and Africa. In this paper, we introduce EgMM-Corpus, a multimodal dataset dedicated to Egyptian culture. By designing and running a new data collection pipeline, we collected over 3,000 images, covering 313 concepts across landmarks, food, and folklore. Each entry in the dataset is manually validated for cultural authenticity and multimodal coherence. EgMM-Corpus aims to provide a reliable resource for evaluating and training vision-language models in an Egyptian cultural context. We further evaluate the zero-shot performance of Contrastive Language-Image Pre-training CLIP on EgMM-Corpus, on which it achieves 21.2% Top-1 accuracy and 36.4% Top-5 accuracy in classification. These results underscore the existing cultural bias in large-scale vision-language models and demonstrate the importance of EgMM-Corpus as a benchmark for developing culturally aware models.

Related papers

CultureScope: A Dimensional Lens for Probing Cultural Understanding in LLMs [57.653830744706305]
CultureScope is the most comprehensive evaluation framework to date for assessing cultural understanding in large language models.<n>Inspired by the cultural iceberg theory, we design a novel dimensional schema for cultural knowledge classification.<n> Experimental results demonstrate that our method can effectively evaluate cultural understanding.
arXiv Detail & Related papers (2025-09-19T17:47:48Z)
Pearl: A Multimodal Culturally-Aware Arabic Instruction Dataset [28.016981736730617]
PEARL is a large-scale Arabic multimodal dataset and benchmark designed for cultural understanding.<n>PEARL comprises over 309K examples spanning ten culturally significant domains covering all Arab countries.
arXiv Detail & Related papers (2025-05-28T05:14:47Z)
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding [79.44246283490665]
We introduce RAVENEA, a new benchmark designed to advance visual culture understanding through retrieval.<n>RAVENEA focuses on two tasks: culture-focused visual question answering (cVQA) and culture-informed image captioning (cIC)<n>We train and evaluate seven multimodal retrievers for each image query, and measure the downstream impact of retrieval-augmented inputs across fourteen state-of-the-art vision-language models.
arXiv Detail & Related papers (2025-05-20T14:57:16Z)
CAReDiO: Cultural Alignment of LLM via Representativeness and Distinctiveness Guided Data Optimization [50.90288681622152]
Large Language Models (LLMs) more deeply integrate into human life across various regions.<n>Existing approaches develop culturally aligned LLMs through fine-tuning with culture-specific corpora.<n>We introduce CAReDiO, a novel cultural data construction framework.
arXiv Detail & Related papers (2025-04-09T13:40:13Z)
GIMMICK -- Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking [29.664707739055068]
We introduce GIMMICK, an extensive benchmark designed to assess a broad spectrum of cultural knowledge across 144 countries.<n>GIMMICK comprises six tasks built upon three new datasets that span 728 unique cultural events or facets.<n>We examine (1) regional cultural biases, (2) the influence of model size, (3) input modalities, and (4) external cues.
arXiv Detail & Related papers (2025-02-19T14:27:40Z)
CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries [63.00147630084146]
Vision-language models (VLMs) have advanced human-AI interaction but struggle with cultural understanding.<n>CultureVerse is a large-scale multimodal benchmark covering 19, 682 cultural concepts, 188 countries/regions, 15 cultural concepts, and 3 question types.<n>We propose CultureVLM, a series of VLMs fine-tuned on our dataset to achieve significant performance improvement in cultural understanding.
arXiv Detail & Related papers (2025-01-02T14:42:37Z)
Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models [22.92083941222383]
We introduce DalleStreet, a large-scale dataset generated by DALL-E 3 and validated by humans. We find disparities in cultural understanding at geographic sub-region levels with both open-source (LLaVA) and closed-source (GPT-4V) models. Our findings reveal a nuanced picture of the cultural competence of LMMs, highlighting the need to develop culture-aware systems.
arXiv Detail & Related papers (2024-07-02T08:55:41Z)
CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection. It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs. We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z)
On the Cultural Gap in Text-to-Image Generation [75.69755281031951]
One challenge in text-to-image (T2I) generation is the inadvertent reflection of culture gaps present in the training data. There is no benchmark to systematically evaluate a T2I model's ability to generate cross-cultural images. We propose a Challenging Cross-Cultural (C3) benchmark with comprehensive evaluation criteria, which can assess how well-suited a model is to a target culture.
arXiv Detail & Related papers (2023-07-06T13:17:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.