Fugu-MT 論文翻訳(概要): VideoLucy: Deep Memory Backtracking for Long Video Understanding

論文の概要: VideoLucy: Deep Memory Backtracking for Long Video Understanding

arxiv url: http://arxiv.org/abs/2510.12422v1
Date: Tue, 14 Oct 2025 11:59:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.303453
Title: VideoLucy: Deep Memory Backtracking for Long Video Understanding
Title（参考訳）: VideoLucy:長いビデオ理解のためのディープメモリバックトラッキング
Authors: Jialong Zuo, Yongtai Deng, Lingdong Kong, Jingkang Yang, Rui Jin, Yiwei Zhang, Nong Sang, Liang Pan, Ziwei Liu, Changxin Gao,
Abstract要約: 我々は、長いビデオ理解のためのディープメモリバックトラックフレームワークであるVideoLucyを提案する。粗いものから細かいものへの人間の再コンパイルプロセスにインスパイアされたVideoLucyは、階層的なメモリ構造で、段階的に粒度が細かい。 VideoLucyは、複数の長いビデオ理解ベンチマークで最先端の手法を著しく上回っている。
参考スコア（独自算出の注目度）: 102.37736560263649
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent studies have shown that agent-based systems leveraging large language models (LLMs) for key information retrieval and integration have emerged as a promising approach for long video understanding. However, these systems face two major challenges. First, they typically perform modeling and reasoning on individual frames, struggling to capture the temporal context of consecutive frames. Second, to reduce the cost of dense frame-level captioning, they adopt sparse frame sampling, which risks discarding crucial information. To overcome these limitations, we propose VideoLucy, a deep memory backtracking framework for long video understanding. Inspired by the human recollection process from coarse to fine, VideoLucy employs a hierarchical memory structure with progressive granularity. This structure explicitly defines the detail level and temporal scope of memory at different hierarchical depths. Through an agent-based iterative backtracking mechanism, VideoLucy systematically mines video-wide, question-relevant deep memories until sufficient information is gathered to provide a confident answer. This design enables effective temporal understanding of consecutive frames while preserving critical details. In addition, we introduce EgoMem, a new benchmark for long video understanding. EgoMem is designed to comprehensively evaluate a model's ability to understand complex events that unfold over time and capture fine-grained details in extremely long videos. Extensive experiments demonstrate the superiority of VideoLucy. Built on open-source models, VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance even surpassing the latest proprietary models such as GPT-4o. Our code and dataset will be made publicly at https://videolucy.github.io
Abstract（参考訳）: 近年,大規模言語モデル(LLM)を利用したエージェントベースシステムによる情報検索と統合が,長期ビデオ理解のための有望なアプローチとして現れている。しかし、これらのシステムは2つの大きな課題に直面している。まず、個々のフレームのモデリングと推論を行い、連続するフレームの時間的コンテキストを捉えるのに苦労する。第二に、高密度フレームレベルのキャプションのコストを低減するために、スパースフレームサンプリングを採用しており、重要な情報を捨てるリスクがある。これらの制限を克服するため、長いビデオ理解のためのディープメモリバックトラックフレームワークであるVideoLucyを提案する。粗いものから細かいものへの人間の再コンパイルプロセスにインスパイアされたVideoLucyは、階層的なメモリ構造で、段階的に粒度が細かい。この構造は、異なる階層の深さにおけるメモリの詳細なレベルと時間的スコープを明確に定義する。エージェントベースの反復的バックトラッキング機構を通じて、VideoLucyは、自信ある回答を提供するのに十分な情報を集めるまで、ビデオ全体、質問関連深層記憶を体系的にマイニングする。この設計は、重要な詳細を保存しながら、連続するフレームの効果的な時間的理解を可能にする。さらに、長いビデオ理解のための新しいベンチマークであるEgoMemを紹介します。 EgoMemは、時間とともに広がる複雑なイベントを理解し、非常に長いビデオできめ細かい詳細をキャプチャするモデルの能力を包括的に評価するように設計されている。大規模な実験は、VideoLucyの優位性を示している。オープンソースのモデルに基づいて構築されたVideoLucyは、複数の長いビデオ理解ベンチマークにおいて最先端の手法よりも大幅に優れており、GPT-4oのような最新のプロプライエタリなモデルよりもパフォーマンスが優れている。私たちのコードとデータセットはhttps://videolucy.github.ioで公開されます。

論文の概要: VideoLucy: Deep Memory Backtracking for Long Video Understanding

関連論文リスト