You Need to Read Again: Multi-granularity Perception Network for Moment
Retrieval in Videos
- URL: http://arxiv.org/abs/2205.12886v1
- Date: Wed, 25 May 2022 16:15:46 GMT
- Title: You Need to Read Again: Multi-granularity Perception Network for Moment
Retrieval in Videos
- Authors: Xin Sun, Xuan Wang, Jialin Gao, Qiong Liu, Xi Zhou
- Abstract summary: We propose a novel Multi-Granularity Perception Network (MGPN) that perceives intra-modality and inter-modality information at a multi-granularity level.
Specifically, we formulate moment retrieval as a multi-choice reading comprehension task and integrate human reading strategies into our framework.
- Score: 19.711703590063976
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Moment retrieval in videos is a challenging task that aims to retrieve the
most relevant video moment in an untrimmed video given a sentence description.
Previous methods tend to perform self-modal learning and cross-modal
interaction in a coarse manner, which neglect fine-grained clues contained in
video content, query context, and their alignment. To this end, we propose a
novel Multi-Granularity Perception Network (MGPN) that perceives intra-modality
and inter-modality information at a multi-granularity level. Specifically, we
formulate moment retrieval as a multi-choice reading comprehension task and
integrate human reading strategies into our framework. A coarse-grained feature
encoder and a co-attention mechanism are utilized to obtain a preliminary
perception of intra-modality and inter-modality information. Then a
fine-grained feature encoder and a conditioned interaction module are
introduced to enhance the initial perception inspired by how humans address
reading comprehension problems. Moreover, to alleviate the huge computation
burden of some existing methods, we further design an efficient choice
comparison module and reduce the hidden size with imperceptible quality loss.
Extensive experiments on Charades-STA, TACoS, and ActivityNet Captions datasets
demonstrate that our solution outperforms existing state-of-the-art methods.
Related papers
- Detecting Misinformation in Multimedia Content through Cross-Modal Entity Consistency: A Dual Learning Approach [10.376378437321437]
We propose a Multimedia Misinformation Detection framework for detecting misinformation from video content by leveraging cross-modal entity consistency.
Our results demonstrate that MultiMD outperforms state-of-the-art baseline models.
arXiv Detail & Related papers (2024-08-16T16:14:36Z) - The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Towards Efficient and Effective Text-to-Video Retrieval with
Coarse-to-Fine Visual Representation Learning [15.998149438353133]
We propose a two-stage retrieval architecture for text-to-video retrieval.
In training phase, we design a parameter-free text-gated interaction block (TIB) for fine-grained video representation learning.
In retrieval phase, we use coarse-grained video representations for fast recall of top-k candidates, which are then reranked by fine-grained video representations.
arXiv Detail & Related papers (2024-01-01T08:54:18Z) - Cross-modal Contrastive Learning with Asymmetric Co-attention Network
for Video Moment Retrieval [0.17590081165362778]
Video moment retrieval is a challenging task requiring fine-grained interactions between video and text modalities.
Recent work in image-text pretraining has demonstrated that most existing pretrained models suffer from information asymmetry due to the difference in length between visual and textual sequences.
We question whether the same problem also exists in the video-text domain with an auxiliary need to preserve both spatial and temporal information.
arXiv Detail & Related papers (2023-12-12T17:00:46Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - Multi-Granularity Network with Modal Attention for Dense Affective
Understanding [11.076925361793556]
In the recent EEV challenge, a dense affective understanding task is proposed and requires frame-level affective prediction.
We propose a multi-granularity network with modal attention (MGN-MA), which employs multi-granularity features for better description of the target frame.
The proposed method achieves the correlation score of 0.02292 in the EEV challenge.
arXiv Detail & Related papers (2021-06-18T07:37:06Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.