Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection
- URL: http://arxiv.org/abs/2501.10692v1
- Date: Sat, 18 Jan 2025 08:09:44 GMT
- Title: Multi-modal Fusion and Query Refinement Network for Video Moment Retrieval and Highlight Detection
- Authors: Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Zien Xie, Youyao Jia, Sidan Du,
- Abstract summary: Given a video and a linguistic query, video moment retrieval and highlight detection (MR&HD) aim to locate all the relevant spans while simultaneously predicting saliency scores.<n>Most existing methods utilize RGB images as input, overlooking the inherent multi-modal visual signals like optical flow and depth map.<n>We propose a Multi-modal Fusion and Query Refinement Network (MRNet) to learn complementary information from multi-modal cues.
- Score: 1.5236380958983644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a video and a linguistic query, video moment retrieval and highlight detection (MR&HD) aim to locate all the relevant spans while simultaneously predicting saliency scores. Most existing methods utilize RGB images as input, overlooking the inherent multi-modal visual signals like optical flow and depth. In this paper, we propose a Multi-modal Fusion and Query Refinement Network (MRNet) to learn complementary information from multi-modal cues. Specifically, we design a multi-modal fusion module to dynamically combine RGB, optical flow, and depth map. Furthermore, to simulate human understanding of sentences, we introduce a query refinement module that merges text at different granularities, containing word-, phrase-, and sentence-wise levels. Comprehensive experiments on QVHighlights and Charades datasets indicate that MRNet outperforms current state-of-the-art methods, achieving notable improvements in MR-mAP@Avg (+3.41) and HD-HIT@1 (+3.46) on QVHighlights.
Related papers
- MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion [44.45109614673675]
We create a search system that extracts text and features from both visual and audio modalities.
MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs.
arXiv Detail & Related papers (2025-03-26T16:28:04Z) - Recurrence-Enhanced Vision-and-Language Transformers for Robust Multimodal Document Retrieval [44.008094698200026]
Cross-modal retrieval is gaining increasing efficacy and interest from the research community.
In this paper, we design an approach that allows for multimodal queries composed of both an image and a text.
Our model, ReT, employs multi-level representations extracted from different layers of both visual and textual backbones.
arXiv Detail & Related papers (2025-03-03T19:01:17Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Dynamic Multimodal Fusion via Meta-Learning Towards Micro-Video Recommendation [97.82707398481273]
We develop a novel meta-learning-based multimodal fusion framework called Meta Multimodal Fusion (MetaMMF)<n>Based on the meta information extracted from the multimodal features of the input task, MetaMMF parameterizes a neural network as the item-specific fusion function via a meta learner.<n>We perform extensive experiments on three benchmark datasets, demonstrating the significant improvements over several state-of-the-art multimodal recommendation models.
arXiv Detail & Related papers (2025-01-13T07:51:43Z) - VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval [8.908777234657046]
Large-language and vision-language models (LLM/LVLMs) have gained prominence across various domains.<n>Here we propose VideoLights, a novel HD/MR framework addressing these limitations through (i) Convolutional Projection and Feature Refinement modules.<n> Comprehensive experiments on QVHighlights, TVSum, and Charades-STA benchmarks demonstrate state-of-the-art performance.
arXiv Detail & Related papers (2024-12-02T14:45:53Z) - PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs)
Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation.
We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval [34.343617836027725]
We propose a multi-level multi-modal hybrid fusion network to explore comprehensive interactions between text queries and each modality content in videos.
Our framework provides two kinds of training strategies, including an ensemble manner and an end-to-end manner.
arXiv Detail & Related papers (2022-08-16T10:51:37Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Multilevel Hierarchical Network with Multiscale Sampling for Video
Question Answering [16.449212284367366]
We propose a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA.
MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR)
With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations.
PVR infers the visual cues at each level in parallel to fit with answering different question types that may rely on the visual information at relevant levels.
arXiv Detail & Related papers (2022-05-09T06:28:56Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z) - M2RNet: Multi-modal and Multi-scale Refined Network for RGB-D Salient
Object Detection [1.002712867721496]
Methods based on RGB-D often suffer from the incompatibility of multi-modal feature fusion and the insufficiency of multi-scale feature aggregation.
We propose a novel multi-modal and multi-scale refined network (M2RNet)
Three essential components are presented in this network.
arXiv Detail & Related papers (2021-09-16T12:15:40Z) - End-to-end Multi-modal Video Temporal Grounding [105.36814858748285]
We propose a multi-modal framework to extract complementary information from videos.
We adopt RGB images for appearance, optical flow for motion, and depth maps for image structure.
We conduct experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-12T17:58:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.