Fugu-MT 論文翻訳(概要): Chronological Blindness: Benchmarking Temporal Reasoning in Vision-Language Models with CHRONOSIGHT

論文の概要: Chronological Blindness: Benchmarking Temporal Reasoning in Vision-Language Models with CHRONOSIGHT

arxiv url: http://arxiv.org/abs/2606.16334v1
Date: Mon, 15 Jun 2026 07:38:27 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-16 16:21:34.15433
Title: Chronological Blindness: Benchmarking Temporal Reasoning in Vision-Language Models with CHRONOSIGHT
Title（参考訳）: 時間的盲点:クロノソライトを用いた視覚言語モデルにおける時間的推論のベンチマーク
Authors: Parthaw Goswami, Jaynto Goswami Deep,
Abstract要約: 視覚的時間的推論の5次元を評価するベンチマークであるChronosIGHTを紹介する。ベンチマークは、数分から数千年に及ぶ8つのプロセスファミリーに1000の項目で構成されている。本研究では,8つのオープンソースVLM(500M〜19Bパラメータ)を2つのプロンプト条件下で評価し,人間のパフォーマンス基準を収集する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Human perception of visual scenes is inherently temporal. We instinctively recognise whether a fruit is ripening or rotting, whether construction is progressing or being demolished, and approximately how much time separates two photographs of the same subject. Whether large vision-language models (VLMs) share this competence remains an open and practically important question. We introduce CHRONOSIGHT, a rigorously controlled benchmark evaluating five dimensions of visual temporal reasoning: CHRONORANK (chronological ordering of image sequences), CHRONOLOCATE (ordinal stage localisation from a single image), CHRONODELTA (estimation of time elapsed between two images on a logarithmic scale), CHRONOREVERSE (detection of temporally reversed sequences), and CHRONOODD (identification of a temporal outlier within a set). The benchmark comprises 1{,}000 items across eight process families (biological growth, food transformation, physical weathering, construction, environmental change, human ageing, astronomical phenomena, and urban dynamics) spanning timescales from minutes to millennia. We evaluate eight open-source VLMs (500 M to 19 B parameters) under two prompting regimes and collect human performance baselines. Human performance averages 0.89 across tasks; the best open model (Qwen2.5-VL-7B) reaches 0.40 under direct prompting, a gap we term chronological blindness. Lightweight LoRA fine-tuning on 151 examples raises CHRONODELTA accuracy from near-zero to 0.43, transferring zero-shot to related tasks (CHRONOODD: 0.37; CHRONOREVERSE: 0.64)suggesting the bottleneck is partly instruction following rather than visual perception. Benchmark, code, and predictions will be released upon acceptance.
Abstract（参考訳）: 人間の視覚的シーンの知覚は本質的に時間的です。我々は,果実が熟成しているか腐っているか,建設が進行しているか破壊されているかを本能的に認識し,同じ被写体の2枚の写真がどの程度の時間で分離されているかを明らかにする。大きな視覚言語モデル(VLM)がこの能力を共有しているかどうかについては、依然としてオープンかつ実践的に重要な問題である。視覚的時間的推論の5次元を厳密に制御したベンチマークであるChronoRANK(画像列の時間的順序付け)、ChronoOCATE(画像列の時間的順序付け)、ChroronlTA(対数スケールで2つの画像の間に経過する時間の推定)、ChronorEverse(時間的反転配列の検出)、ChronoDD(セット内の時間的外乱の同定)を紹介した。このベンチマークは、8つのプロセスファミリー(生物学的成長、食品の変換、物理的風化、建設、環境変化、人類の老化、天文学的現象、都市力学)にまたがる1{,}000項目から成っている。本研究では,8つのオープンソースVLM(500M〜19Bパラメータ)を2つのプロンプト条件下で評価し,人間のパフォーマンスベースラインを収集する。最高のオープンモデル(Qwen2.5-VL-7B)は、直接的プロンプトの下で0.40に達する。 151例の軽量LORA微調整は、クロノデレタの精度をほぼゼロから0.43に引き上げ、ゼロショットを関連するタスク(CHRONOODD: 0.37; CHRONOREVERSE: 0.64)に転送する。ベンチマーク、コード、予測は受け入れ次第リリースされる。

論文の概要: Chronological Blindness: Benchmarking Temporal Reasoning in Vision-Language Models with CHRONOSIGHT

関連論文リスト