Fugu-MT 論文翻訳(概要): VideoNorms: Benchmarking Cultural Awareness of Video Language Models

論文の概要: VideoNorms: Benchmarking Cultural Awareness of Video Language Models

arxiv url: http://arxiv.org/abs/2510.08543v1
Date: Thu, 09 Oct 2025 17:54:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 17:54:15.289857
Title: VideoNorms: Benchmarking Cultural Awareness of Video Language Models
Title（参考訳）: VideoNorms: ビデオ言語モデルの文化的認識のベンチマーク
Authors: Nikhil Reddy Varimalla, Yunfei Xu, Arkadiy Saakyan, Meng Fan Wang, Smaranda Muresan,
Abstract要約: 私たちは、米国と中国の文化から1000以上のペア(ビデオクリップ、標準)のベンチマークであるVideoNormsを紹介します。我々は人間とAIの協調フレームワークを使用し、理論的に接地されたプロンプトを用いた教師モデルが候補アノテーションを提供する。新しいデータセット上で、さまざまなオープンウェイトなVideoLLMをベンチマークし、いくつかの共通のトレンドを強調します。
参考スコア（独自算出の注目度）: 19.29068943180369
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As Video Large Language Models (VideoLLMs) are deployed globally, they require understanding of and grounding in the relevant cultural background. To properly assess these models' cultural awareness, adequate benchmarks are needed. We introduce VideoNorms, a benchmark of over 1000 (video clip, norm) pairs from US and Chinese cultures annotated with socio-cultural norms grounded in speech act theory, norm adherence and violations labels, and verbal and non-verbal evidence. To build VideoNorms, we use a human-AI collaboration framework, where a teacher model using theoretically-grounded prompting provides candidate annotations and a set of trained human experts validate and correct the annotations. We benchmark a variety of open-weight VideoLLMs on the new dataset which highlight several common trends: 1) models performs worse on norm violation than adherence; 2) models perform worse w.r.t Chinese culture compared to the US culture; 3) models have more difficulty in providing non-verbal evidence compared to verbal for the norm adhere/violation label and struggle to identify the exact norm corresponding to a speech-act; and 4) unlike humans, models perform worse in formal, non-humorous contexts. Our findings emphasize the need for culturally-grounded video language model training - a gap our benchmark and framework begin to address.
Abstract（参考訳）: Video Large Language Models (VideoLLMs) は世界規模で展開されているため、関連する文化的背景を理解する必要がある。これらのモデルの文化的認識を適切に評価するには、適切なベンチマークが必要である。我々は,アメリカと中国文化の1000組以上の(ビデオクリップ,ノルム)ペアのベンチマークであるVideoNormsを紹介した。 VideoNormsを構築するために、理論的に接地されたプロンプトを用いた教師モデルが候補アノテーションと訓練された人間の専門家のセットを提供し、アノテーションを検証し修正する、人間とAIの協調フレームワークを使用します。我々は、新しいデータセットで様々なオープンウェイトなVideoLLMをベンチマークし、いくつかの共通のトレンドを強調した。 1) モデルは,従順性よりも規範違反に悪影響を及ぼす。 2) モデルは,米国文化と比較して中国文化が悪くなる。 3) モデルは、標準遵守/違反ラベルの口頭弁別よりも、非言語的証拠の提供が困難であり、かつ、音声行為に対応する正確な規範の特定に苦慮している。 4)人間とは異なり、モデルはフォーマルで非ハーモラスな文脈では悪化する。私たちの発見は、文化的なビデオ言語モデルのトレーニングの必要性を強調しています。

論文の概要: VideoNorms: Benchmarking Cultural Awareness of Video Language Models

関連論文リスト