Fugu-MT 論文翻訳(概要): Are Video Reasoning Models Ready to Go Outside?

論文の概要: Are Video Reasoning Models Ready to Go Outside?

arxiv url: http://arxiv.org/abs/2603.10652v1
Date: Wed, 11 Mar 2026 11:10:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-12 16:22:32.909781
Title: Are Video Reasoning Models Ready to Go Outside?
Title（参考訳）: ビデオ推論モデルは外へ出る準備はできているか?
Authors: Yangfan He, Changgyu Boo, Jaehong Yoon,
Abstract要約: ROVA(ROVA)は、モデリング汚職下での堅牢性を改善するためのトレーニングフレームワークである。我々はまた、実世界の摂動を埋め込みビデオデータセットに注入する新しいベンチマークであるPVRBenchも導入した。 ROVAは性能劣化を効果的に軽減し、相対精度を少なくとも24%向上させ、ベースラインモデルと比較して9%以上低下させる。
参考スコア（独自算出の注目度）: 24.177016151528637
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In real-world deployment, vision-language models often encounter disturbances such as weather, occlusion, and camera motion. Under such conditions, their understanding and reasoning degrade substantially, revealing a gap between clean, controlled (i.e., unperturbed) evaluation settings and real-world robustness. To address this limitation, we propose ROVA, a novel training framework that improves robustness by modeling a robustness-aware consistency reward under spatio-temporal corruptions. ROVA introduces a difficulty-aware online training strategy that prioritizes informative samples based on the model's evolving capability. Specifically, it continuously re-estimates sample difficulty via self-reflective evaluation, enabling adaptive training with a robustness-aware consistency reward. We also introduce PVRBench, a new benchmark that injects real-world perturbations into embodied video datasets to assess both accuracy and reasoning quality under realistic disturbances. We evaluate ROVA and baselines on PVRBench, UrbanVideo, and VisBench, where open-source and proprietary models suffer up to 35% and 28% drops in accuracy and reasoning under realistic perturbations. ROVA effectively mitigates performance degradation, boosting relative accuracy by at least 24% and reasoning by over 9% compared with baseline models (QWen2.5/3-VL, InternVL2.5, Embodied-R). These gains transfer to clean standard benchmarks, yielding consistent improvements.
Abstract（参考訳）: 実世界の展開において、視覚言語モデルは、しばしば天気、閉塞、カメラモーションなどの乱れに遭遇する。このような条件下では、それらの理解と推論は大幅に低下し、クリーンで制御された(不飽和な)評価設定と現実世界の堅牢さのギャップが明らかになる。この制限に対処するため,時空間汚職下での堅牢性を考慮した一貫性報酬をモデル化することによりロバスト性を向上させる新しいトレーニングフレームワークROVAを提案する。 ROVAは、モデルの進化する能力に基づいて情報的サンプルを優先する、難易度の高いオンライントレーニング戦略を導入している。具体的には、自己回帰評価によってサンプルの難易度を継続的に再推定し、頑健性を考慮した適応トレーニングを可能にする。 PVRBenchは、実世界の摂動を埋め込みビデオデータセットに注入し、現実的な乱れの下での精度と推論品質の両方を評価する新しいベンチマークである。 ROVAとベースラインをPVRBench、UrbanVideo、VisBenchで評価し、オープンソースおよびプロプライエタリモデルでは、リアルな摂動下での精度と推論が最大35%、プロプライエタリモデルでは28%低下する。 ROVAは性能劣化を効果的に軽減し、相対精度を少なくとも24%引き上げ、ベースラインモデル(QWen2.5/3-VL、InternVL2.5、Embodied-R)と比較して9%以上推算する。これらのベンチマークはクリーンな標準ベンチマークに移行し、一貫した改善をもたらす。

論文の概要: Are Video Reasoning Models Ready to Go Outside?

関連論文リスト