Fugu-MT 論文翻訳(概要): OSCBench: Benchmarking Object State Change in Text-to-Video Generation

論文の概要: OSCBench: Benchmarking Object State Change in Text-to-Video Generation

arxiv url: http://arxiv.org/abs/2603.11698v1
Date: Thu, 12 Mar 2026 09:08:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.983862
Title: OSCBench: Benchmarking Object State Change in Text-to-Video Generation
Title（参考訳）: OSCBench: テキスト・ビデオ生成におけるオブジェクトの状態変化のベンチマーク
Authors: Xianjing Han, Bin Zhu, Shiqi Hu, Franklin Mingzhe Li, Patrick Carrington, Roger Zimmermann, Jingjing Chen,
Abstract要約: 我々は、テキスト・ビデオ生成モデルにおいて、オブジェクト状態変化(OSC)のパフォーマンスを評価するために特別に設計されたベンチマークであるOSCBenchを紹介する。 OSCは、ジャガイモの皮剥きやレモンのスライスなどの作用によって引き起こされる物体の状態の変化を指す。そこで我々は,MLLM(Multimodal large language model)とMLLM(Multimodal large language model)の両方を用いて,オープンソースおよびプロプライエタリなT2Vモデルの評価を行った。
参考スコア（独自算出の注目度）: 47.72341406051041
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Text-to-video (T2V) generation models have made rapid progress in producing visually high-quality and temporally coherent videos. However, existing benchmarks primarily focus on perceptual quality, text-video alignment, or physical plausibility, leaving a critical aspect of action understanding largely unexplored: object state change (OSC) explicitly specified in the text prompt. OSC refers to the transformation of an object's state induced by an action, such as peeling a potato or slicing a lemon. In this paper, we introduce OSCBench, a benchmark specifically designed to assess OSC performance in T2V models. OSCBench is constructed from instructional cooking data and systematically organizes action-object interactions into regular, novel, and compositional scenarios to probe both in-distribution performance and generalization. We evaluate six representative open-source and proprietary T2V models using both human user study and multimodal large language model (MLLM)-based automatic evaluation. Our results show that, despite strong performance on semantic and scene alignment, current T2V models consistently struggle with accurate and temporally consistent object state changes, especially in novel and compositional settings. These findings position OSC as a key bottleneck in text-to-video generation and establish OSCBench as a diagnostic benchmark for advancing state-aware video generation models.
Abstract（参考訳）: テキスト・ツー・ビデオ(T2V)生成モデルは、視覚的に高品質で時間的に一貫性のあるビデオを生成するために急速に進歩している。しかし、既存のベンチマークは主に知覚品質、テキスト・ビデオのアライメント、物理的妥当性に重点を置いており、アクション理解の重要な側面はほとんど探索されていない:オブジェクトの状態変化(OSC)はテキストプロンプトで明示的に規定されている。 OSCは、ジャガイモの皮剥きやレモンのスライスなどの作用によって引き起こされる物体の状態の変化を指す。本稿では,T2VモデルにおけるOSCの性能を評価するためのベンチマークであるOSCBenchを紹介する。 OSCBenchは、命令的調理データから構築され、通常の、新しい、構成的なシナリオにアクションオブジェクトの相互作用を体系的に整理し、分散性能と一般化の両方を調査する。そこで我々は,MLLM(Multimodal large language model)とMLLM(Multimodal large language model)の両方を用いて,オープンソースおよびプロプライエタリなT2Vモデルの評価を行った。以上の結果から,現在のT2Vモデルでは,特に新規および構成的設定において,高精度で時間的に一貫したオブジェクト状態の変化に常に苦労していることが明らかとなった。これらの結果から,OSCはテキスト・ビデオ生成における重要なボトルネックとして位置づけられ,OSCBenchは状態認識ビデオ生成モデルの診断ベンチマークとして確立された。

論文の概要: OSCBench: Benchmarking Object State Change in Text-to-Video Generation

関連論文リスト