Fugu-MT 論文翻訳(概要): Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

論文の概要: Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

arxiv url: http://arxiv.org/abs/2603.18523v1
Date: Thu, 19 Mar 2026 06:10:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-20 17:19:05.977684
Title: Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models
Title（参考訳）: 計数回路:大規模視覚言語モデルにおける視覚推論の機械論的解釈可能性
Authors: Liwei Che, Zhiyu Xue, Yihao Quan, Benlin Liu, Zeru Shi, Michelle Hurst, Jacob Feldman, Ruixiang Tang, Ranjay Krishna, Vladimir Pavlovic,
Abstract要約: カウントは、LVLM(Large Vision-Language Model)推論の強力なテストとして機能する。その結果,LVLMは人間的なカウント動作を示し,小数量での精度の高い性能と,大量でのノイズ評価が可能であることがわかった。本稿では,単純かつ豊富に利用可能な合成画像を利用して任意の事前学習LVLMを微調整する,軽量な介入戦略を提案する。
参考スコア（独自算出の注目度）: 35.71430064413904
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Counting serves as a simple but powerful test of a Large Vision-Language Model's (LVLM's) reasoning; it forces the model to identify each individual object and then add them all up. In this study, we investigate how LVLMs implement counting using controlled synthetic and real-world benchmarks, combined with mechanistic analyses. Our results show that LVLMs display a human-like counting behavior, with precise performance on small numerosities and noisy estimation for larger quantities. We introduce two novel interpretability methods, Visual Activation Patching and HeadLens, and use them to uncover a structured "counting circuit" that is largely shared across a variety of visual reasoning tasks. Building on these insights, we propose a lightweight intervention strategy that exploits simple and abundantly available synthetic images to fine-tune arbitrary pretrained LVLMs exclusively on counting. Despite the narrow scope of this fine-tuning, the intervention not only enhances counting accuracy on in-distribution synthetic data, but also yields an average improvement of +8.36% on out-of-distribution counting benchmarks and an average gain of +1.54% on complex, general visual reasoning tasks for Qwen2.5-VL. These findings highlight the central, influential role of counting in visual reasoning and suggest a potential pathway for improving overall visual reasoning capabilities through targeted enhancement of counting mechanisms.
Abstract（参考訳）: カウントは、LVLM(Large Vision-Language Model's)推論の単純かつ強力なテストとして機能する。本研究では,LVLMが制御された合成および実世界のベンチマークを用いてカウントをどのように実装するかを,力学解析と組み合わせて検討する。以上の結果から,LVLMは人間的なカウント動作を示し,小数量での精度の高い評価と,大量でのノイズ評価が可能であることがわかった。本稿では、視覚的アクティベーション・パッチングとHeadLensという2つの新しい解釈可能性手法を導入し、様々な視覚的推論タスクで共有される構造化された「カウント回路」を明らかにする。これらの知見に基づいて, 簡便かつ豊富に利用可能な合成画像を利用して, 任意の学習済みLVLMを微調整する, 軽量な介入戦略を提案する。この微調整の範囲は狭いが、この介入は分配内合成データのカウント精度を高めるだけでなく、分配外カウントのベンチマークでは平均で+8.36%向上し、Qwen2.5-VLの複雑な視覚的推論タスクでは+1.54%向上した。これらの知見は、視覚的推論におけるカウントの中枢的かつ影響力のある役割を浮き彫りにし、カウント機構の強化による全体的な視覚的推論能力向上のための潜在的経路を示唆している。

論文の概要: Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

関連論文リスト