Learning to Combine the Modalities of Language and Video for Temporal
Moment Localization
- URL: http://arxiv.org/abs/2109.02925v1
- Date: Tue, 7 Sep 2021 08:25:45 GMT
- Title: Learning to Combine the Modalities of Language and Video for Temporal
Moment Localization
- Authors: Jungkyoo Shin and Jinyoung Moon
- Abstract summary: Temporal moment localization aims to retrieve the best video segment matching a moment specified by a query.
We introduce a novel recurrent unit, cross-modal long short-term memory (CM-LSTM), by mimicking the human cognitive process of localizing temporal moments.
We also devise a two-stream attention mechanism for both attended and unattended video features by the input query to prevent necessary visual information from being neglected.
- Score: 4.203274985072923
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Temporal moment localization aims to retrieve the best video segment matching
a moment specified by a query. The existing methods generate the visual and
semantic embeddings independently and fuse them without full consideration of
the long-term temporal relationship between them. To address these
shortcomings, we introduce a novel recurrent unit, cross-modal long short-term
memory (CM-LSTM), by mimicking the human cognitive process of localizing
temporal moments that focuses on the part of a video segment related to the
part of a query, and accumulates the contextual information across the entire
video recurrently. In addition, we devise a two-stream attention mechanism for
both attended and unattended video features by the input query to prevent
necessary visual information from being neglected. To obtain more precise
boundaries, we propose a two-stream attentive cross-modal interaction network
(TACI) that generates two 2D proposal maps obtained globally from the
integrated contextual features, which are generated by using CM-LSTM, and
locally from boundary score sequences and then combines them into a final 2D
map in an end-to-end manner. On the TML benchmark dataset,
ActivityNet-Captions, the TACI outperform state-of-the-art TML methods with R@1
of 45.50% and 27.23% for IoU@0.5 and IoU@0.7, respectively. In addition, we
show that the revised state-of-the-arts methods by replacing the original LSTM
with our CM-LSTM achieve performance gains.
Related papers
- Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - HTNet: Anchor-free Temporal Action Localization with Hierarchical
Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video.
We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video.
We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z) - Exploiting long-term temporal dynamics for video captioning [40.15826846670479]
We propose a novel approach, namely temporal and spatial LSTM (TS-LSTM), which systematically exploits spatial and temporal dynamics within video sequences.
Experimental results obtained in two public video captioning benchmarks indicate that our TS-LSTM outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2022-02-22T11:40:09Z) - Multimodal Transformer with Variable-length Memory for
Vision-and-Language Navigation [79.1669476932147]
Vision-and-Language Navigation (VLN) is a task that an agent is required to follow a language instruction to navigate to the goal position.
Recent Transformer-based VLN methods have made great progress benefiting from the direct connections between visual observations and the language instruction.
We introduce Multimodal Transformer with Variable-length Memory (MTVM) for visually-grounded natural language navigation.
arXiv Detail & Related papers (2021-11-10T16:04:49Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - Improving Video Instance Segmentation via Temporal Pyramid Routing [61.10753640148878]
Video Instance (VIS) is a new and inherently multi-task problem, which aims to detect, segment and track each instance in a video sequence.
We propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames.
Our approach is a plug-and-play module and can be easily applied to existing instance segmentation methods.
arXiv Detail & Related papers (2021-07-28T03:57:12Z) - Cross-modal Consensus Network for Weakly Supervised Temporal Action
Localization [74.34699679568818]
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision.
We propose a cross-modal consensus network (CO2-Net) to tackle this problem.
arXiv Detail & Related papers (2021-07-27T04:21:01Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.