Related papers: VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension

VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension

URL: http://arxiv.org/abs/2504.17821v2
Date: Tue, 20 May 2025 09:15:16 GMT
Title: VideoVista-CulturalLingo: 360$^\circ$ Horizons-Bridging Cultures, Languages, and Domains in Video Comprehension
Authors: Xinyu Chen, Yunxin Li, Haoyuan Shi, Baotian Hu, Wenhan Luo, Yaowei Wang, Min Zhang,
Abstract summary: We present VideoVista-CulturalLingo, the first video evaluation benchmark designed to bridge cultural, linguistic, and domain divide in video comprehension.<n>VideoVista-CulturalLingo contains 1,389 videos and 3,134 QA pairs, and we have evaluated 24 recent open-source or proprietary video large models.
Score: 66.03062468036507
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Assessing the video comprehension capabilities of multimodal AI systems can effectively measure their understanding and reasoning abilities. Most video evaluation benchmarks are limited to a single language, typically English, and predominantly feature videos rooted in Western cultural contexts. In this paper, we present VideoVista-CulturalLingo, the first video evaluation benchmark designed to bridge cultural, linguistic, and domain divide in video comprehension. Our work differs from existing benchmarks in the following ways: 1) Cultural diversity, incorporating cultures from China, North America, and Europe; 2) Multi-linguistics, with questions presented in Chinese and English-two of the most widely spoken languages; and 3) Broad domain, featuring videos sourced from hundreds of human-created domains. VideoVista-CulturalLingo contains 1,389 videos and 3,134 QA pairs, and we have evaluated 24 recent open-source or proprietary video large models. From the experiment results, we observe that: 1) Existing models perform worse on Chinese-centric questions than Western-centric ones, particularly those related to Chinese history; 2) Current open-source models still exhibit limitations in temporal understanding, especially in the Event Localization task, achieving a maximum score of only 45.2%; 3) Mainstream models demonstrate strong performance in general scientific questions, while open-source models demonstrate weak performance in mathematics.

Related papers

AVMeme Exam: A Multimodal Multilingual Multicultural Benchmark for LLMs' Contextual and Cultural Knowledge and Thinking [59.15472057710525]
AVMeme Exam is a human-curated benchmark of over one thousand iconic Internet sounds and videos spanning speech, songs, music, and sound effects.<n>Each meme is paired with a unique Q&A assessing levels of understanding from surface content to context and emotion to usage and world knowledge.<n>We systematically evaluate state-of-the-art multimodal large language models (MLLMs) alongside human participants using this benchmark.
arXiv Detail & Related papers (2026-01-25T01:40:15Z)
CURVE: A Benchmark for Cultural and Multilingual Long Video Reasoning [58.73855961335903]
CURVE (Cultural Understanding and Reasoning in Video Evaluation) is a challenging benchmark for multicultural and multilingual video reasoning.<n>It comprises high-quality, entirely human-generated annotations from diverse, region-specific cultural videos across 18 global locales.<n>Our evaluations reveal that SoTA Video-LLMs struggle significantly, performing substantially below human-level accuracy.
arXiv Detail & Related papers (2026-01-15T18:15:06Z)
ChineseVideoBench: Benchmarking Multi-modal Large Models for Chinese Video Question Answering [41.26674080292025]
ChineseVideoBench is a benchmark for evaluating Multimodal Large Language Models (MLLMs) in Chinese Video Question Answering.<n>It comprises 8 main classes and 12 sub-classes, encompassing tasks that demand both deep video understanding and nuanced Chinese linguistic and cultural awareness.<n> Gemini 2.5 Pro achieves the highest performance with an overall score of 77.9%, while InternVL-38B emerges as the most competitive open-source model.
arXiv Detail & Related papers (2025-11-23T10:56:22Z)
IndicVisionBench: Benchmarking Cultural and Multilingual Understanding in VLMs [2.697578491761838]
IndicVisionBench is the first large-scale benchmark centered on the Indian subcontinent.<n>Our benchmark spans 3 multimodal tasks, including Optical Character Recognition (OCR), Multimodal Machine Translation (MMT), and Visual Question Answering (VQA)<n>In addition, we release a paired parallel corpus of annotations across 10 Indic languages, creating a unique resource for analyzing cultural and linguistic biases in VLMs.
arXiv Detail & Related papers (2025-11-06T18:01:22Z)
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm [73.4888880112019]
"Thinking with Video" paradigm bridges visual and textual reasoning in a unified temporal framework.<n>Sora-2 is established as a capable reasoner on vision-centric tasks.<n>On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU.
arXiv Detail & Related papers (2025-11-06T17:25:23Z)
TCC-Bench: Benchmarking the Traditional Chinese Culture Understanding Capabilities of MLLMs [13.069833806549914]
We propose the Traditional Chinese Culture understanding Benchmark (TCC-Bench) for assessing the understanding of traditional Chinese culture.<n>TCC-Bench comprises culturally rich and visually diverse data, incorporating images from museum artifacts, everyday life scenes, comics, and other culturally significant contexts.<n>We adopt a semi-automated pipeline that utilizes GPT-4o in text-only mode to generate candidate questions, followed by human curation to ensure data quality and avoid potential data leakage.
arXiv Detail & Related papers (2025-05-16T14:10:41Z)
All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark [74.4821011648997]
MAIA is a benchmark designed for fine-grained investigation of the reasoning abilities of visual language models on videos.<n>It evaluates Vision Language Models (VLMs) on two aligned tasks.<n>It considers twelve reasoning categories that aim to disentangle language and vision relations.
arXiv Detail & Related papers (2025-02-24T09:25:51Z)
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines [74.25764182510295]
Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English.<n>We introduce World Cuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding.<n>This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points.
arXiv Detail & Related papers (2024-10-16T16:11:49Z)
Comparing diversity, negativity, and stereotypes in Chinese-language AI technologies: an investigation of Baidu, Ernie and Qwen [1.3354439722832292]
We study Chinese-based tools by investigating social biases embedded in the major Chinese search engine, Baidu.<n>We collect over 30k views encoded in the aforementioned tools by prompting them for candidate words describing such groups.<n>We find that language models exhibit a larger variety of embedded views compared to the search engine, although Baidu and Qwen generate negative content more often than Ernie.
arXiv Detail & Related papers (2024-08-28T10:51:18Z)
CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark [68.21939124278065]
Culturally-diverse multilingual Visual Question Answering benchmark designed to cover a rich set of languages and cultures. CVQA includes culturally-driven images and questions from across 30 countries on four continents, covering 31 languages with 13 scripts, providing a total of 10k questions. We benchmark several Multimodal Large Language Models (MLLMs) on CVQA, and show that the dataset is challenging for the current state-of-the-art models.
arXiv Detail & Related papers (2024-06-10T01:59:00Z)
A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video. Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding. To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z)
Understanding Chinese Video and Language via Contrastive Multimodal Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training. VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.