Fugu-MT 論文翻訳(概要): Encoding and Controlling Global Semantics for Long-form Video Question Answering

論文の概要: Encoding and Controlling Global Semantics for Long-form Video Question Answering

arxiv url: http://arxiv.org/abs/2405.19723v3
Date: Sat, 05 Oct 2024 14:02:31 GMT
ステータス: 翻訳完了
システム内更新日: 2024-12-02 21:52:44.93808
Title: Encoding and Controlling Global Semantics for Long-form Video Question Answering
Title（参考訳）: 長文ビデオ質問応答のためのグローバルセマンティクスの符号化と制御
Authors: Thong Thanh Nguyen, Zhiyuan Hu, Xiaobao Wu, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu,
Abstract要約: 我々は、ビデオのグローバルなセマンティクスを効率的に統合するために、状態空間層(SSL)をマルチモーダルトランスフォーマーに導入する。私たちのSSLには、グローバルなセマンティクスから視覚表現へのフローを制御可能にするゲーティングユニットが含まれています。長大なビデオQA能力を評価するため,Ego-QAとMAD-QAの2つの新しいベンチマークを構築した。
参考スコア（独自算出の注目度）: 40.129800076300434
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Seeking answers effectively for long videos is essential to build video question answering (videoQA) systems. Previous methods adaptively select frames and regions from long videos to save computations. However, this fails to reason over the whole sequence of video, leading to sub-optimal performance. To address this problem, we introduce a state space layer (SSL) into multi-modal Transformer to efficiently integrate global semantics of the video, which mitigates the video information loss caused by frame and region selection modules. Our SSL includes a gating unit to enable controllability over the flow of global semantics into visual representations. To further enhance the controllability, we introduce a cross-modal compositional congruence (C^3) objective to encourage global semantics aligned with the question. To rigorously evaluate long-form videoQA capacity, we construct two new benchmarks Ego-QA and MAD-QA featuring videos of considerably long length, i.e. 17.5 minutes and 1.9 hours, respectively. Extensive experiments demonstrate the superiority of our framework on these new as well as existing datasets. The code, model, and data have been made available at https://nguyentthong.github.io/Long_form_VideoQA.
Abstract（参考訳）: ビデオ質問応答( videoQA)システムを構築するためには,長時間ビデオに対して効果的に回答を求めることが不可欠である。従来の手法では、長いビデオからフレームや領域を適応的に選択して計算を保存していた。しかし、これはビデオのシーケンス全体に対する推論に失敗し、サブ最適パフォーマンスに繋がる。この問題に対処するため,マルチモーダルトランスフォーマに状態空間層(SSL)を導入し,映像のグローバルセマンティクスを効率的に統合し,フレームや領域選択モジュールによる映像情報の損失を軽減する。私たちのSSLには、グローバルなセマンティクスから視覚表現へのフローを制御可能にするゲーティングユニットが含まれています。制御性をさらに高めるため,グローバルな意味論を促進するために,クロスモーダルな構成合同(C^3)の目的を導入する。 Ego-QAとMAD-QAはそれぞれ17.5分と1.9時間というかなり長いビデオを含む2つの新しいベンチマークを構築した。大規模な実験は、これらの新しいデータセットと既存のデータセットに対する我々のフレームワークの優位性を実証している。コード、モデル、データはhttps://nguyentthong.github.io/Long_form_VideoQAで公開されている。

論文の概要: Encoding and Controlling Global Semantics for Long-form Video Question Answering

関連論文リスト