Fugu-MT 論文翻訳(概要): LensWalk: Agentic Video Understanding by Planning How You See in Videos

論文の概要: LensWalk: Agentic Video Understanding by Planning How You See in Videos

arxiv url: http://arxiv.org/abs/2603.24558v1
Date: Wed, 25 Mar 2026 17:38:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.412619
Title: LensWalk: Agentic Video Understanding by Planning How You See in Videos
Title（参考訳）: LensWalk: ビデオの見え方を計画したエージェント的ビデオ理解
Authors: Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang, Shiguang Shan,
Abstract要約: 我々はLensWalkを紹介した。LensWalkは、大規模言語モデル推論器が自身の視覚的観察を積極的に制御できるようにするフレキシブルなエージェントフレームワークである。 LensWalkは、エージェントが各ステップで、観察するビデオの時間的スコープとサンプリング密度を動的に特定する、厳密な理由-計画-観測ループを確立する。
参考スコア（独自算出の注目度）: 45.81048261339695
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The dense, temporal nature of video presents a profound challenge for automated analysis. Despite the use of powerful Vision-Language Models, prevailing methods for video understanding are limited by the inherent disconnect between reasoning and perception: they rely on static, pre-processed information and cannot actively seek raw evidence from video as their understanding evolves. To address this, we introduce LensWalk, a flexible agentic framework that empowers a Large Language Model reasoner to control its own visual observation actively. LensWalk establishes a tight reason-plan-observe loop where the agent dynamically specifies, at each step, the temporal scope and sampling density of the video it observes. Using a suite of versatile, Vision-Language Model based tools parameterized by these specifications, the agent can perform broad scans for cues, focus on specific segments for fact extraction, and stitch evidence from multiple moments for holistic verification. This design allows for progressive, on-demand evidence gathering that directly serves the agent's evolving chain of thought. Without requiring any model fine-tuning, LensWalk delivers substantial, plug-and-play performance gains on multiple model recipes, boosting their accuracy by over 5\% on challenging long-video benchmarks like LVBench and Video-MME. Our analysis reveals that enabling an agent to control how it sees is key to unlocking more accurate, robust, and interpretable video reasoning.
Abstract（参考訳）: ビデオの高密度で時間的な性質は、自動分析にとって大きな課題である。強力なヴィジュアル・ランゲージ・モデルを用いることにもかかわらず、ビデオ理解の一般的な方法は、推論と知覚の本質的にの切り離しによって制限される。これを解決するためにLensWalkを紹介した。LensWalkは、大規模言語モデル推論器が自身の視覚的観察を積極的に制御できるようにするフレキシブルなエージェントフレームワークである。 LensWalkは、エージェントが各ステップで、観察するビデオの時間的スコープとサンプリング密度を動的に特定する、厳密な理由-計画-観測ループを確立する。これらの仕様によってパラメータ化された多用途のビジョン・ランゲージ・モデルベースのツール群を用いて、エージェントはキューの広いスキャン、事実抽出のための特定のセグメントに焦点を当て、総合的な検証のために複数の瞬間から証拠を縫い合わせることができる。この設計は、エージェントの進化する思考の連鎖を直接補完する進歩的でオンデマンドな証拠収集を可能にする。モデル微調整を必要とせずに、LensWalkは複数のモデルレシピでプラグアンドプレイのパフォーマンス向上を実現し、LVBenchやVideo-MMEといった長ビデオベンチマークで5倍以上の精度向上を実現している。分析の結果、エージェントがどのように見えるかを制御することが、より正確で堅牢で解釈可能なビデオ推論の鍵であることが判明した。

論文の概要: LensWalk: Agentic Video Understanding by Planning How You See in Videos

関連論文リスト