Summarize the Past to Predict the Future: Natural Language Descriptions
of Context Boost Multimodal Object Interaction Anticipation
- URL: http://arxiv.org/abs/2301.09209v4
- Date: Sun, 10 Mar 2024 17:21:25 GMT
- Title: Summarize the Past to Predict the Future: Natural Language Descriptions
of Context Boost Multimodal Object Interaction Anticipation
- Authors: Razvan-George Pasca, Alexey Gavryushin, Muhammad Hamza, Yen-Ling Kuo,
Kaichun Mo, Luc Van Gool, Otmar Hilliges, Xi Wang
- Abstract summary: We propose TransFusion, a multimodal transformer-based architecture.
It exploits the representational power of language by summarizing the action context.
Our model enables more efficient end-to-end learning.
- Score: 72.74191015833397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study object interaction anticipation in egocentric videos. This task
requires an understanding of the spatio-temporal context formed by past actions
on objects, coined action context. We propose TransFusion, a multimodal
transformer-based architecture. It exploits the representational power of
language by summarizing the action context. TransFusion leverages pre-trained
image captioning and vision-language models to extract the action context from
past video frames. This action context together with the next video frame is
processed by the multimodal fusion module to forecast the next object
interaction. Our model enables more efficient end-to-end learning. The large
pre-trained language models add common sense and a generalisation capability.
Experiments on Ego4D and EPIC-KITCHENS-100 show the effectiveness of our
multimodal fusion model. They also highlight the benefits of using
language-based context summaries in a task where vision seems to suffice. Our
method outperforms state-of-the-art approaches by 40.4% in relative terms in
overall mAP on the Ego4D test set. We validate the effectiveness of TransFusion
via experiments on EPIC-KITCHENS-100. Video and code are available at
https://eth-ait.github.io/transfusion-proj/.
Related papers
- Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - Efficient End-to-End Video Question Answering with Pyramidal Multimodal
Transformer [13.71165050314854]
We present a new method for end-to-end Video Questioning (VideoQA)
We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer.
We demonstrate better or on-par performances with high computational efficiency against state-the-art methods on five VideoQA benchmarks.
arXiv Detail & Related papers (2023-02-04T09:14:18Z) - Holistic Interaction Transformer Network for Action Detection [15.667833703317124]
"HIT" network is a comprehensive bi-modal framework that comprises an RGB stream and a pose stream.
Our method significantly outperforms previous approaches on the J-HMDB, UCF101-24, and MultiSports datasets.
arXiv Detail & Related papers (2022-10-23T10:19:37Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - With a Little Help from my Temporal Context: Multimodal Egocentric
Action Recognition [95.99542238790038]
We propose a method that learns to attend to surrounding actions in order to improve recognition performance.
To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities.
We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance.
arXiv Detail & Related papers (2021-11-01T15:27:35Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.