Fugu-MT 論文翻訳(概要): Static and Dynamic Graph Alignment Network for Temporal Video Grounding

論文の概要: Static and Dynamic Graph Alignment Network for Temporal Video Grounding

arxiv url: http://arxiv.org/abs/2605.00684v1
Date: Fri, 01 May 2026 14:16:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.98232
Title: Static and Dynamic Graph Alignment Network for Temporal Video Grounding
Title（参考訳）: 時間的ビデオグラウンドのための静的・動的グラフアライメントネットワーク
Authors: Zhanjie Hu, Bolin Zhang, Jianhua Wang, Jianbo Zheng, Chenchen Yan, Takahiro Komamizu, Ichiro Ide, Jiangbo Qian,
Abstract要約: 時間的ビデオグラウンディングは、与えられた自然言語クエリにセマンティックに対応した、トリミングされていないビデオ内の時間的モーメントをローカライズすることを目的としている。ビデオクリップ間の時間関係をモデル化するために,GCN (Graph Convolutional Networks) がテレビGで広く採用されている。
参考スコア（独自算出の注目度）: 17.14274381541407
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal Video Grounding (TVG) aims to localize temporal moments in an untrimmed video that semantically correspond to given natural language queries. Recently, Graph Convolutional Networks (GCN) have been widely adopted in TVG to model temporal relations among video clips and enhance contextual reasoning by constructing clip-level graphs. Despite their effectiveness, existing GCN-based TVG methods encounter three critical bottlenecks: 1) Most methods construct graph nodes using either static or dynamic features alone, resulting in incomplete visual representation and overlooking complementary semantics, 2) Most methods construct temporal graphs in a query-agnostic manner, leading to inefficient feature interaction within the temporal graph representation, and 3) Most methods often suffer from a single-granularity semantic matching, while direct training on complex temporal localization task may lead to slow convergence and suboptimal precision. To address these challenges, we propose Static and Dynamic Graph Alignment Network (SDGAN). First, SDGAN jointly exploits static and dynamic visual features to construct two complementary temporal graphs and performs Position-wise Nodes Alignment, enabling more expressive and robust visual representation. Second, SDGAN introduces Query-Clip Contrastive Learning and Adaptive Graph Modeling to explicitly align visual clips with their corresponding textual queries, yielding query-aware visual representations. Third, SDGAN incorporates multi-granularity temporal proposals within Progressive Easy-to-Hard Training Strategy, effectively bridging coarse-grained semantic localization and fine-grained temporal boundary refinement. Extensive experiments on three benchmark datasets demonstrate that SDGAN achieves superior performance across complex TVG scenarios. Codes and datasets are available at https://github.com/ZhanJieHu/SDGAN.
Abstract（参考訳）: 時間的ビデオグラウンドリング(TVG)は、与えられた自然言語クエリにセマンティックに対応する未編集ビデオにおいて、時間的モーメントをローカライズすることを目的としている。近年、ビデオクリップ間の時間関係をモデル化し、クリップレベルのグラフを構築することで文脈推論を強化するために、GCN(Graph Convolutional Networks)がテレビGで広く採用されている。その効果にもかかわらず、既存のGCNベースのTVG法は3つの重大なボトルネックに直面している。 1)ほとんどのメソッドは静的あるいは動的特徴のみを用いてグラフノードを構築しており,結果として不完全な視覚表現と相補的意味論を見落としている。 2)ほとんどの手法は時間グラフを問合せに依存しない方法で構築し,時間グラフ表現における非効率な特徴相互作用につながる。 3)ほとんどの手法は単一粒度のセマンティックマッチングに悩まされるが,複雑な時間的局所化タスクの直接訓練は収束の遅さと最適下限の精度に繋がる。これらの課題に対処するために、静的および動的グラフアライメントネットワーク(SDGAN)を提案する。まず、SDGANは静的および動的視覚的特徴を併用して、2つの補完的な時間グラフを構築し、位置対応ノードアライメントを実行し、より表現力が高く堅牢な視覚表現を可能にする。第二に、SDGANはクエリ-Clip Contrastive LearningとAdaptive Graph Modelingを導入し、ビジュアルクリップを対応するテキストクエリに明示的にアライメントし、クエリ対応のビジュアル表現を生成する。第三に、SDGANはプログレッシブ・イージー・ツー・ハード・トレーニング・ストラテジー(Progressive Easy-to-Hard Training Strategy)に多粒性時間的提案を組み込んでおり、粗いセマンティックローカライゼーションときめ細かい時間的境界改善を効果的にブリッジしている。 3つのベンチマークデータセットに対する大規模な実験は、SDGANが複雑なTVGシナリオで優れたパフォーマンスを達成することを示した。コードとデータセットはhttps://github.com/ZhanJieHu/SDGAN.comで公開されている。

論文の概要: Static and Dynamic Graph Alignment Network for Temporal Video Grounding

関連論文リスト