With a Little Help from my Temporal Context: Multimodal Egocentric
Action Recognition
- URL: http://arxiv.org/abs/2111.01024v1
- Date: Mon, 1 Nov 2021 15:27:35 GMT
- Title: With a Little Help from my Temporal Context: Multimodal Egocentric
Action Recognition
- Authors: Evangelos Kazakos, Jaesung Huh, Arsha Nagrani, Andrew Zisserman, Dima
Damen
- Abstract summary: We propose a method that learns to attend to surrounding actions in order to improve recognition performance.
To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities.
We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance.
- Score: 95.99542238790038
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In egocentric videos, actions occur in quick succession. We capitalise on the
action's temporal context and propose a method that learns to attend to
surrounding actions in order to improve recognition performance. To incorporate
the temporal context, we propose a transformer-based multimodal model that
ingests video and audio as input modalities, with an explicit language model
providing action sequence context to enhance the predictions. We test our
approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art
performance. Our ablations showcase the advantage of utilising temporal context
as well as incorporating audio input modality and language model to rescore
predictions. Code and models at: https://github.com/ekazakos/MTCN.
Related papers
- Multi-Modal interpretable automatic video captioning [1.9874264019909988]
We introduce a novel video captioning method trained with multi-modal contrastive loss.
Our approach is designed to capture the dependency between these modalities, resulting in more accurate, thus pertinent captions.
arXiv Detail & Related papers (2024-11-11T11:12:23Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation [9.93719767430551]
This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABA6 competition.
We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features.
We employed a Transformer encoder structure to learn long-range dependencies, thereby enhancing the model's performance and generalization ability.
arXiv Detail & Related papers (2024-03-19T04:25:54Z) - Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding [108.79026216923984]
Video grounding aims to localize a-temporal section in a video corresponding to an input text query.
This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task.
arXiv Detail & Related papers (2023-12-31T13:53:37Z) - READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling [31.745255364708864]
We introduce lightweight adapters to the pre-trained model and only update them at fine-tuning time.
We propose a novel REcurrent ADapter (READ) that employs recurrent computation to enable temporal modeling capability.
We validate our READ framework through extensive experiments where READ significantly outperforms all existing fine-tuning strategies.
arXiv Detail & Related papers (2023-12-12T03:09:30Z) - Summarize the Past to Predict the Future: Natural Language Descriptions
of Context Boost Multimodal Object Interaction Anticipation [72.74191015833397]
We propose TransFusion, a multimodal transformer-based architecture.
It exploits the representational power of language by summarizing the action context.
Our model enables more efficient end-to-end learning.
arXiv Detail & Related papers (2023-01-22T21:30:12Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.