Related papers: VideoNorms: Benchmarking Cultural Awareness of Video Language Models

VideoNorms: Benchmarking Cultural Awareness of Video Language Models

URL: http://arxiv.org/abs/2510.08543v1
Date: Thu, 09 Oct 2025 17:54:55 GMT
Title: VideoNorms: Benchmarking Cultural Awareness of Video Language Models
Authors: Nikhil Reddy Varimalla, Yunfei Xu, Arkadiy Saakyan, Meng Fan Wang, Smaranda Muresan,
Abstract summary: We introduce VideoNorms, a benchmark of over 1000 (video clip, norm) pairs from US and Chinese cultures.<n>We use a human-AI collaboration framework, where a teacher model using theoretically-grounded prompting provides candidate annotations.<n>We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends.
Score: 19.29068943180369
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. To properly assess these models' cultural awareness, adequate benchmarks are needed. We introduce VideoNorms, a benchmark of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms grounded in speech act theory, norm adherence and violations labels, and verbal and non-verbal evidence. To build VideoNorms, we use a human-AI collaboration framework, where a teacher model using theoretically-grounded prompting provides candidate annotations and a set of trained human experts validate and correct the annotations. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models performs worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label and struggle to identify the exact norm corresponding to a speech-act; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training - a gap our benchmark and framework begin to address.

Related papers

AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking [59.15472057710525]
AVMeme Exam is a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects.<n>Each meme is paired with a unique Q&A assessing levels of understanding from surface content to context and emotion to usage and world knowledge.<n>We systematically evaluate state-of-the-art multimodal large language models (MLLMs) alongside human participants using this benchmark.
arXiv Detail & Related papers (2026-01-25T01:40:15Z)
CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning [58.73855961335903]
CURVE (Cultural Understanding and Reasoning in Video Evaluation) is a challenging benchmark for multicultural and multilingual video reasoning.<n>It comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales.<n>Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy.
arXiv Detail & Related papers (2026-01-15T18:15:06Z)
Where Culture Fades: Revealing the Cultural Gap in Text-to-Image Generation [43.352493955825736]
We show that current T2I models often produce culturally neutral or English-biased results under multilingual prompts.<n>We propose a probing method that localizes culture-sensitive signals to a small set of neurons in a few fixed layers.
arXiv Detail & Related papers (2025-11-21T14:40:50Z)
Culture in Action: Evaluating Text-to-Image Models through Social Activities [40.874302288116304]
Text-to-image (T2I) models achieve impressive photorealism by training on large-scale web data, but models inherit cultural biases and fail to depict underrepresented regions faithfully.<n>We introduce CULTIVate, a benchmark for evaluating T2I models on cross-cultural activities.<n>We propose four metrics to measure cultural alignment, hallucination, exaggerated elements, and diversity.
arXiv Detail & Related papers (2025-11-07T19:51:11Z)
Identity-Aware Large Language Models require Cultural Reasoning [3.1866496693431934]
We define cultural reasoning as the capacity of a model to recognise culture-specific knowledge values and social norms.<n>Because culture shapes interpretation, emotional resonance, and acceptable behaviour, cultural reasoning is essential for identity-aware AI.<n>We argue that cultural reasoning must be treated as a foundational capability alongside factual accuracy and linguistic coherence.
arXiv Detail & Related papers (2025-10-21T10:50:51Z)
Hire Your Anthropologist! Rethinking Culture Benchmarks Through an Anthropological Lens [9.000522371422628]
We introduce a four-part framework that categorizes how benchmarks frame culture.<n>We qualitatively examine 20 cultural benchmarks and identify six recurring methodological issues.<n>Our aim is to guide the development of cultural benchmarks that go beyond static recall tasks.
arXiv Detail & Related papers (2025-10-07T13:42:44Z)
VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension [66.03062468036507]
We present VideoVista-CulturalLingo, the first video evaluation benchmark designed to bridge cultural, linguistic, and domain divide in video comprehension.<n>VideoVista-CulturalLingo contains 1,389 videos and 3,134 QA pairs, and we have evaluated 24 recent open-source or proprietary video large models.
arXiv Detail & Related papers (2025-04-23T13:47:30Z)
Attributing Culture-Conditioned Generations to Pretraining Corpora [26.992883552982335]
We analyze how models associate entities with cultures based on pretraining data patterns.<n>We find that high-frequency cultures in pretraining data yield more generations with memorized symbols, while some low-frequency cultures produce none.
arXiv Detail & Related papers (2024-12-30T07:09:25Z)
CulturePark: Boosting Cross-cultural Understanding in Large Language Models [63.452948673344395]
This paper introduces CulturePark, an LLM-powered multi-agent communication framework for cultural data collection. It generates high-quality cross-cultural dialogues encapsulating human beliefs, norms, and customs. We evaluate these models across three downstream tasks: content moderation, cultural alignment, and cultural education.
arXiv Detail & Related papers (2024-05-24T01:49:02Z)
Not All Countries Celebrate Thanksgiving: On the Cultural Dominance in Large Language Models [89.94270049334479]
This paper identifies a cultural dominance issue within large language models (LLMs) LLMs often provide inappropriate English-culture-related answers that are not relevant to the expected culture when users ask in non-English languages.
arXiv Detail & Related papers (2023-10-19T05:38:23Z)
On the Cultural Gap in Text-to-Image Generation [75.69755281031951]
One challenge in text-to-image (T2I) generation is the inadvertent reflection of culture gaps present in the training data. There is no benchmark to systematically evaluate a T2I model's ability to generate cross-cultural images. We propose a Challenging Cross-Cultural (C3) benchmark with comprehensive evaluation criteria, which can assess how well-suited a model is to a target culture.
arXiv Detail & Related papers (2023-07-06T13:17:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.