Language Guided Networks for Cross-modal Moment Retrieval
- URL: http://arxiv.org/abs/2006.10457v2
- Date: Wed, 9 Sep 2020 05:19:24 GMT
- Title: Language Guided Networks for Cross-modal Moment Retrieval
- Authors: Kun Liu, Huadong Ma, and Chuang Gan
- Abstract summary: Cross-modal moment retrieval aims to localize a temporal segment from an untrimmed video described by a natural language query.
Existing methods independently extract the features of videos and sentences.
We present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval.
- Score: 66.49445903955777
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the challenging task of cross-modal moment retrieval, which aims
to localize a temporal segment from an untrimmed video described by a natural
language query. It poses great challenges over the proper semantic alignment
between vision and linguistic domains. Existing methods independently extract
the features of videos and sentences and purely utilize the sentence embedding
in the multi-modal fusion stage, which do not make full use of the potential of
language. In this paper, we present Language Guided Networks (LGN), a new
framework that leverages the sentence embedding to guide the whole process of
moment retrieval. In the first feature extraction stage, we propose to jointly
learn visual and language features to capture the powerful visual information
which can cover the complex semantics in the sentence query. Specifically, the
early modulation unit is designed to modulate the visual feature extractor's
feature maps by a linguistic embedding. Then we adopt a multi-modal fusion
module in the second fusion stage. Finally, to get a precise localizer, the
sentence information is utilized to guide the process of predicting temporal
positions. Specifically, the late guidance module is developed to linearly
transform the output of localization networks via the channel attention
mechanism. The experimental results on two popular datasets demonstrate the
superior performance of our proposed method on moment retrieval (improving by
5.8\% in terms of Rank1@IoU0.5 on Charades-STA and 5.2\% on TACoS). The source
code for the complete system will be publicly available.
Related papers
- MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Context-Aware Integration of Language and Visual References for Natural Language Tracking [27.3884348078998]
Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame.
We propose a joint multi-modal tracking framework with 1) a prompt module to leverage the complement between temporal visual templates and language expressions, enabling precise and context-aware appearance and linguistic cues.
This design ensures-temporal consistency by leveraging historical visual information and an integrated solution, generating predictions in a single step.
arXiv Detail & Related papers (2024-03-29T04:58:33Z) - A Dual Semantic-Aware Recurrent Global-Adaptive Network For
Vision-and-Language Navigation [3.809880620207714]
Vision-and-Language Navigation (VLN) is a realistic but challenging task that requires an agent to locate the target region using verbal and visual cues.
This work proposes a dual semantic-aware recurrent global-adaptive network (DSRG) to address the above problems.
arXiv Detail & Related papers (2023-05-05T15:06:08Z) - Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning [25.230786853723203]
We propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages.
We use Machine Translation to construct pseudo-parallel sentence pairs for low-resource languages.
We introduce a multi-view self-distillation method to learn noise-robust target-language representations.
arXiv Detail & Related papers (2022-08-26T09:32:24Z) - Modeling Motion with Multi-Modal Features for Text-Based Video
Segmentation [56.41614987789537]
Text-based video segmentation aims to segment the target object in a video based on a describing sentence.
We propose a method to fuse and align appearance, motion, and linguistic features to achieve accurate segmentation.
arXiv Detail & Related papers (2022-04-06T02:42:33Z) - Progressive Localization Networks for Language-based Moment Localization [56.54450664871467]
This paper focuses on the task of language-based moment localization.
Most existing methods prefer to first sample sufficient candidate moments with various temporal lengths, and then match them with the given query to determine the target moment.
We propose a novel multi-stage Progressive localization Network (PLN) which progressively localizes the target moment in a coarse-to-fine manner.
arXiv Detail & Related papers (2021-02-02T03:45:59Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query.
We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query.
The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.