Fugu-MT 論文翻訳(概要): Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

論文の概要: Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

arxiv url: http://arxiv.org/abs/2301.09209v4
Date: Sun, 10 Mar 2024 17:21:25 GMT
ステータス: 翻訳完了
システム内更新日: 2024-03-13 17:46:36.027862
Title: Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation
Title（参考訳）: 未来を予測するために過去を要約する: 自然言語記述によるマルチモーダルオブジェクト相互作用予測
Authors: Razvan-George Pasca, Alexey Gavryushin, Muhammad Hamza, Yen-Ling Kuo, Kaichun Mo, Luc Van Gool, Otmar Hilliges, Xi Wang
Abstract要約: マルチモーダルトランスアーキテクチャであるTransFusionを提案する。アクションコンテキストを要約することで、言語の表現力を利用する。我々のモデルはより効率的なエンドツーエンド学習を可能にします。
参考スコア（独自算出の注目度）: 72.74191015833397
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatio-temporal context formed by past actions on objects, coined action context. We propose TransFusion, a multimodal transformer-based architecture. It exploits the representational power of language by summarizing the action context. TransFusion leverages pre-trained image captioning and vision-language models to extract the action context from past video frames. This action context together with the next video frame is processed by the multimodal fusion module to forecast the next object interaction. Our model enables more efficient end-to-end learning. The large pre-trained language models add common sense and a generalisation capability. Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our multimodal fusion model. They also highlight the benefits of using language-based context summaries in a task where vision seems to suffice. Our method outperforms state-of-the-art approaches by 40.4% in relative terms in overall mAP on the Ego4D test set. We validate the effectiveness of TransFusion via experiments on EPIC-KITCHENS-100. Video and code are available at https://eth-ait.github.io/transfusion-proj/.
Abstract（参考訳）: エゴセントリックビデオにおけるオブジェクトインタラクションの予測について検討する。このタスクは、過去のアクションがオブジェクトに生成した時空間的コンテキストの理解を必要とする。マルチモーダルトランスアーキテクチャであるTransFusionを提案する。アクションコンテキストを要約することで言語の表現力を利用する。 TransFusionは、トレーニング済みの画像キャプションとビジョン言語モデルを利用して、過去のビデオフレームからアクションコンテキストを抽出する。このアクションコンテキストと次のビデオフレームは、マルチモーダル融合モジュールによって処理され、次のオブジェクトの相互作用を予測する。我々のモデルはより効率的なエンドツーエンド学習を可能にします。大きな事前訓練された言語モデルには、常識と一般化機能が追加されている。 Ego4D と EPIC-KITCHENS-100 の実験により, マルチモーダル核融合モデルの有効性が示された。また、視覚が十分であると思われるタスクで言語ベースのコンテキスト要約を使用することの利点も強調している。提案手法は,Ego4Dテストセット全体のmAPにおいて,相対的に40.4%向上する。 EPIC-KITCHENS-100実験によるTransFusionの有効性を検証した。ビデオとコードはhttps://eth-ait.github.io/transfusion-proj/で入手できる。

論文の概要: Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation

関連論文リスト