Fugu-MT 論文翻訳(概要): Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction

論文の概要: Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction

arxiv url: http://arxiv.org/abs/2301.09209v1
Date: Sun, 22 Jan 2023 21:30:12 GMT
ステータス: 翻訳完了
システム内更新日: 2023-01-24 14:27:28.748116
Title: Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction
Title（参考訳）: 未来を予測するために過去を要約する: 自然言語によるマルチモーダルオブジェクトインタラクションの強化
Authors: Razvan-George Pasca, Alexey Gavryushin, Yen-Ling Kuo, Otmar Hilliges, Xi Wang
Abstract要約: 将来のアクションとオブジェクトをうまく予測するには、過去のアクションとオブジェクトによって形成される時間的コンテキストを理解する必要がある。本稿では,言語表現力を効果的に活用するマルチモーダルトランスフォーマーアーキテクチャであるTransFusionを提案する。本モデルは,高密度な映像特徴を言語表現に置き換えることで,より効率的なエンドツーエンド学習を可能にする。
参考スコア（独自算出の注目度）: 31.720450334117086
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the task of object interaction anticipation in egocentric videos. Successful prediction of future actions and objects requires an understanding of the spatio-temporal context formed by past actions and object relationships. We propose TransFusion, a multimodal transformer-based architecture, that effectively makes use of the representational power of language by summarizing past actions concisely. TransFusion leverages pre-trained image captioning models and summarizes the caption, focusing on past actions and objects. This action context together with a single input frame is processed by a multimodal fusion module to forecast the next object interactions. Our model enables more efficient end-to-end learning by replacing dense video features with language representations, allowing us to benefit from knowledge encoded in large pre-trained models. Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our multimodal fusion model and the benefits of using language-based context summaries. Our method outperforms state-of-the-art approaches by 40.4% in overall mAP on the Ego4D test set. We show the generality of TransFusion via experiments on EPIC-KITCHENS-100. Video and code are available at: https://eth-ait.github.io/transfusion-proj/.
Abstract（参考訳）: エゴセントリックビデオにおける物体相互作用予測の課題について検討する。将来のアクションとオブジェクトの予測に成功するには、過去のアクションとオブジェクトの関係によって形成される時空間的コンテキストを理解する必要がある。本稿では,過去の行動を簡潔に要約し,言語の表現力を有効に活用するマルチモーダルトランスフォーマティブアーキテクチャであるtransfusionを提案する。 TransFusionはトレーニング済みの画像キャプションモデルを活用し、過去のアクションやオブジェクトに焦点を当ててキャプションを要約する。このアクションコンテキストと単一の入力フレームはマルチモーダル融合モジュールによって処理され、次のオブジェクトインタラクションを予測する。我々のモデルは,高密度な映像特徴を言語表現に置き換えることで,より効率的なエンドツーエンド学習を可能にする。 Ego4D と EPIC-KITCHENS-100 の実験は、我々の多モード融合モデルの有効性と言語に基づく文脈要約の利点を示している。提案手法は,Ego4Dテストセット全体のmAPを40.4%向上させる。 EPIC-KITCHENS-100の実験によるTransFusionの一般性を示す。ビデオとコードは、https://eth-ait.github.io/transfusion-proj/.com/で入手できる。

論文の概要: Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction

関連論文リスト