Fugu-MT 論文翻訳(概要): V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge in Vision Language Models

論文の概要: V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge in Vision Language Models

arxiv url: http://arxiv.org/abs/2603.16581v1
Date: Tue, 17 Mar 2026 14:33:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.340552
Title: V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge in Vision Language Models
Title（参考訳）: V-DyKnow:視覚言語モデルにおける時間知覚的知識の動的ベンチマーク
Authors: Seyed Mahed Mousavi, Christian Moiola, Massimo Rizzoli, Simone Alghisi, Giuseppe Riccardi,
Abstract要約: 現実の事実は本質的に時間に敏感であり、不規則かつ周期的な変化にさらされている。 V-DyKnowは、視覚言語モデルにおいて、時間に敏感な事実知識を評価するためのベンチマークである。
参考スコア（独自算出の注目度）: 1.424507155580441
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Vision-Language Models (VLMs) are trained on data snapshots of documents, including images and texts. Their training data and evaluation benchmarks are typically static, implicitly treating factual knowledge as time-invariant. However, real-world facts are intrinsically time-sensitive and subject to erratic and periodic changes, causing model predictions to become outdated. We present V-DyKnow, a Visual Dynamic Knowledge benchmark for evaluating time-sensitive factual knowledge in VLMs. Using V-DyKnow, we benchmark closed- and open-source VLMs and analyze a) the reliability (correctness and consistency) of model responses across modalities and input perturbations; b) the efficacy of knowledge editing and multi-modal RAG methods for knowledge updates across modalities; and c) the sources of outdated predictions, through data and mechanistic analysis. Our results show that VLMs frequently output outdated facts, reflecting outdated snapshots used in the (pre-)training phase. Factual reliability degrades from textual to visual stimuli, even when entities are correctly recognized. Besides, existing alignment approaches fail to consistently update the models' knowledge across modalities. Together, these findings highlight fundamental limitations in how current VLMs acquire and update time-sensitive knowledge across modalities. We release the benchmark, code, and evaluation data.
Abstract（参考訳）: ビジョンランゲージモデル(VLM)は、画像やテキストを含む文書のデータスナップショットに基づいて訓練される。トレーニングデータと評価ベンチマークは通常静的で、事実知識を時間不変として暗黙的に扱う。しかし、現実の事実は本質的に時間に敏感であり、不規則かつ周期的な変化が伴うため、モデル予測は時代遅れになる。本稿では,VLMにおける時間に敏感な事実知識を評価するためのビジュアル・ダイナミック・ナレッジ・ベンチマークであるV-DyKnowを紹介する。 V-DyKnowを用いて、クローズドおよびオープンソースVLMのベンチマークと分析を行った。イモダリティ及び入力摂動のモデル応答の信頼性(正確性及び整合性) ロモダリティを越えた知識更新のための知識編集及びマルチモーダルRAG方法の有効性 c) データ及び機械解析を通じて、時代遅れの予測の源泉以上の結果から,VLM は (事前) 学習フェーズで使用される古くなったスナップショットを反映して,時代遅れの事実を頻繁に出力することがわかった。現実的な信頼性は、実体が正しく認識されている場合でも、テキストから視覚的刺激へと低下する。さらに、既存のアライメントアプローチは、モダリティを越えてモデルの知識を継続的に更新することができない。これらの知見は、現在のVLMが時間に敏感な知識をモダリティを越えて取得し、更新する方法の根本的な制限を浮き彫りにしている。ベンチマーク、コード、評価データをリリースします。

論文の概要: V-DyKnow: A Dynamic Benchmark for Time-Sensitive Knowledge in Vision Language Models

関連論文リスト