Related papers: M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval

M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval

URL: http://arxiv.org/abs/2208.07664v1
Date: Tue, 16 Aug 2022 10:51:37 GMT
Title: M2HF: Multi-level Multi-modal Hybrid Fusion for Text-Video Retrieval
Authors: Shuo Liu, Weize Quan, Ming Zhou, Sihong Chen, Jian Kang, Zhe Zhao, Chen Chen, Dong-Ming Yan
Abstract summary: We propose a multi-level multi-modal hybrid fusion network to explore comprehensive interactions between text queries and each modality content in videos. Our framework provides two kinds of training strategies, including an ensemble manner and an end-to-end manner.
Score: 34.343617836027725
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Videos contain multi-modal content, and exploring multi-level cross-modal interactions with natural language queries can provide great prominence to text-video retrieval task (TVR). However, new trending methods applying large-scale pre-trained model CLIP for TVR do not focus on multi-modal cues in videos. Furthermore, the traditional methods simply concatenating multi-modal features do not exploit fine-grained cross-modal information in videos. In this paper, we propose a multi-level multi-modal hybrid fusion (M2HF) network to explore comprehensive interactions between text queries and each modality content in videos. Specifically, M2HF first utilizes visual features extracted by CLIP to early fuse with audio and motion features extracted from videos, obtaining audio-visual fusion features and motion-visual fusion features respectively. Multi-modal alignment problem is also considered in this process. Then, visual features, audio-visual fusion features, motion-visual fusion features, and texts extracted from videos establish cross-modal relationships with caption queries in a multi-level way. Finally, the retrieval outputs from all levels are late fused to obtain final text-video retrieval results. Our framework provides two kinds of training strategies, including an ensemble manner and an end-to-end manner. Moreover, a novel multi-modal balance loss function is proposed to balance the contributions of each modality for efficient end-to-end training. M2HF allows us to obtain state-of-the-art results on various benchmarks, eg, Rank@1 of 64.9\%, 68.2\%, 33.2\%, 57.1\%, 57.8\% on MSR-VTT, MSVD, LSMDC, DiDeMo, and ActivityNet, respectively.

Related papers

MAGMaR Shared Task System Description: Video Retrieval with OmniEmbed [55.526939500742]
We use OmniEmbed, a powerful multimodal embedding model from the Tevatron 2.0 toolkit, to generate unified embeddings for text, images, audio, and video.<n>Our submission achieved the highest score on the MAGMaR shared task leaderboard among public submissions as of May 20th, 2025.
arXiv Detail & Related papers (2025-06-11T05:40:26Z)
CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval [70.9990850395981]
We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes 4 modalities: video frames, transcribed speech, on-screen text, and metadata.<n>CLaMR is trained to enhance dynamic modality selection via two key innovations.
arXiv Detail & Related papers (2025-06-06T15:02:30Z)
MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion [44.45109614673675]
We create a search system that extracts text and features from both visual and audio modalities. MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs.
arXiv Detail & Related papers (2025-03-26T16:28:04Z)
CFSum: A Transformer-Based Multi-Modal Video Summarization Framework With Coarse-Fine Fusion [22.58710742780161]
CFSum is a transformer-based multi-modal video summarization framework with coarse-fine fusion. CFSum exploits video, text, and audio modal features as input, and incorporates a two-stage transformer-based feature fusion framework.
arXiv Detail & Related papers (2025-03-01T06:13:13Z)
Everything is a Video: Unifying Modalities through Next-Frame Prediction [5.720266474212221]
We introduce a novel framework that extends the concept of task reformulation beyond natural language processing (NLP) to multimodal learning. We propose to reformulate diverse multimodal tasks into a unified next-frame prediction problem, allowing a single model to handle different modalities without modality-specific components. Our approach is evaluated on a range of tasks, including text-to-text, image-to-text, video-to-video, video-to-text, and audio-to-text.
arXiv Detail & Related papers (2024-11-15T12:59:37Z)
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions. We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos. Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z)
VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts. We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z)
Browse and Concentrate: Comprehending Multimodal Content via prior-LLM Context Fusion [70.9767518332692]
Multimodal Large Language Models (MLLMs) that incorporate LLMs with pre-trained vision models have recently demonstrated impressive performance across diverse vision-language tasks. However, they fall short to comprehend context involving multiple images. We propose a two phase paradigm, browse-and-concentrate, to enable in-depth multimodal context fusion.
arXiv Detail & Related papers (2024-02-19T14:59:07Z)
Emu: Generative Pretraining in Multimodality [43.759593451544546]
Transformer-based multimodal foundation model can seamlessly generate images and texts in multimodal context. Emu can serve as a generalist multimodal interface for both image-to-text and text-to-image tasks. Emu demonstrates superb performance compared to state-of-the-art large multimodal models.
arXiv Detail & Related papers (2023-07-11T12:45:39Z)
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining. It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement. It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z)
Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering [16.449212284367366]
We propose a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA. MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR) With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations. PVR infers the visual cues at each level in parallel to fit with answering different question types that may rely on the visual information at relevant levels.
arXiv Detail & Related papers (2022-05-09T06:28:56Z)
MHMS: Multimodal Hierarchical Multimedia Summarization [80.18786847090522]
We propose a multimodal hierarchical multimedia summarization (MHMS) framework by interacting visual and language domains. Our method contains video and textual segmentation and summarization module, respectively. It formulates a cross-domain alignment objective with optimal transport distance to generate the representative and textual summary.
arXiv Detail & Related papers (2022-04-07T21:00:40Z)
Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.