Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos
- URL: http://arxiv.org/abs/2007.14164v1
- Date: Tue, 28 Jul 2020 12:40:59 GMT
- Title: Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos
- Authors: Shaoxiang Chen, Wenhao Jiang, Wei Liu, Yu-Gang Jiang
- Abstract summary: We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
- Score: 76.21297023629589
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically generating sentences to describe events and temporally
localizing sentences in a video are two important tasks that bridge language
and videos. Recent techniques leverage the multimodal nature of videos by using
off-the-shelf features to represent videos, but interactions between modalities
are rarely explored. Inspired by the fact that there exist cross-modal
interactions in the human brain, we propose a novel method for learning
pairwise modality interactions in order to better exploit complementary
information for each pair of modalities in videos and thus improve performances
on both tasks. We model modality interaction in both the sequence and channel
levels in a pairwise fashion, and the pairwise interaction also provides some
explainability for the predictions of target tasks. We demonstrate the
effectiveness of our method and validate specific design choices through
extensive ablation studies. Our method turns out to achieve state-of-the-art
performances on four standard benchmark datasets: MSVD and MSR-VTT (event
captioning task), and Charades-STA and ActivityNet Captions (temporal sentence
localization task).
Related papers
- Multi-Modal interpretable automatic video captioning [1.9874264019909988]
We introduce a novel video captioning method trained with multi-modal contrastive loss.
Our approach is designed to capture the dependency between these modalities, resulting in more accurate, thus pertinent captions.
arXiv Detail & Related papers (2024-11-11T11:12:23Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - With a Little Help from my Temporal Context: Multimodal Egocentric
Action Recognition [95.99542238790038]
We propose a method that learns to attend to surrounding actions in order to improve recognition performance.
To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities.
We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance.
arXiv Detail & Related papers (2021-11-01T15:27:35Z) - Hierarchical Deep Residual Reasoning for Temporal Moment Localization [48.108468456043994]
We propose a Hierarchical Deep Residual Reasoning (HDRR) model, which decomposes the video and sentence into multi-level representations with different semantics.
We also design the simple yet effective Res-BiGRUs for feature fusion, which is able to grasp the useful information in a self-adapting manner.
arXiv Detail & Related papers (2021-10-31T07:13:34Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.