Fugu-MT 論文翻訳(概要): UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark

論文の概要: UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark

arxiv url: http://arxiv.org/abs/2509.24427v1
Date: Mon, 29 Sep 2025 08:14:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:19.850351
Title: UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark
Title（参考訳）: UI2V-Bench: 理解に基づく画像-映像生成ベンチマーク
Authors: Ailing Zhang, Lina Lei, Dehong Kong, Zhixin Wang, Jiaqi Xu, Fenglong Song, Chun-Le Guo, Chang Liu, Fan Li, Jie Chen,
Abstract要約: I2V(Image-to-Video)の生成は、ビデオ合成の分野において重要な焦点となっている。既存の評価ベンチマークは主にビデオの品質や時間的一貫性といった側面に焦点を当てている。 We propose UI2V-Bench, a novel benchmark for evaluation I2V model with focus on semantic understanding and reasoning。
参考スコア（独自算出の注目度）: 35.157850129371525
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generative diffusion models are developing rapidly and attracting increasing attention due to their wide range of applications. Image-to-Video (I2V) generation has become a major focus in the field of video synthesis. However, existing evaluation benchmarks primarily focus on aspects such as video quality and temporal consistency, while largely overlooking the model's ability to understand the semantics of specific subjects in the input image or to ensure that the generated video aligns with physical laws and human commonsense. To address this gap, we propose UI2V-Bench, a novel benchmark for evaluating I2V models with a focus on semantic understanding and reasoning. It introduces four primary evaluation dimensions: spatial understanding, attribute binding, category understanding, and reasoning. To assess these dimensions, we design two evaluation methods based on Multimodal Large Language Models (MLLMs): an instance-level pipeline for fine-grained semantic understanding, and a feedback-based reasoning pipeline that enables step-by-step causal assessment for more accurate evaluation. UI2V-Bench includes approximately 500 carefully constructed text-image pairs and evaluates a range of both open source and closed-source I2V models across all defined dimensions. We further incorporate human evaluations, which show strong alignment with the proposed MLLM-based metrics. Overall, UI2V-Bench fills a critical gap in I2V evaluation by emphasizing semantic comprehension and reasoning ability, offering a robust framework and dataset to support future research and model development in the field.
Abstract（参考訳）: 生成拡散モデルは急速に発展し、広範囲の応用により注目が集まっている。 I2V(Image-to-Video)の生成は、ビデオ合成の分野において重要な焦点となっている。しかし、既存の評価ベンチマークは、主にビデオの品質や時間的一貫性といった側面に焦点を当てているが、入力画像中の特定の対象のセマンティクスを理解したり、生成されたビデオが物理法則や人間の常識と整合することを保証するためのモデルの能力を主に見落としている。このギャップに対処するために、意味理解と推論に焦点をあてた新しいI2Vモデル評価ベンチマークであるUI2V-Benchを提案する。空間的理解、属性結合、カテゴリー理解、推論の4つの主要な評価次元を導入している。これらの次元を評価するために,多モーダル大言語モデル(MLLM)に基づく2つの評価手法を設計した。 UI2V-Benchは、約500の慎重に構築されたテキストイメージペアを含み、すべての定義された次元にわたって、オープンソースとクローズドソースのI2Vモデルの両方の範囲を評価する。さらに,提案したMLLMに基づく指標と強い整合性を示す人的評価を取り入れた。 UI2V-Benchは、セマンティック理解と推論能力を強調し、将来的な研究とモデル開発をサポートするための堅牢なフレームワークとデータセットを提供する。

論文の概要: UI2V-Bench: An Understanding-based Image-to-video Generation Benchmark

関連論文リスト