Related papers: Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

URL: http://arxiv.org/abs/2506.10202v1
Date: Wed, 11 Jun 2025 21:37:54 GMT
Title: Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval
Authors: Shubhashis Roy Dipta, Francis Ferraro,
Abstract summary: Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs)<n>We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval.
Score: 9.230429417848393
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.

Related papers

MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion [43.725594356981254]
We create a search system that extracts text and features from both visual and audio modalities.<n> MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs.
arXiv Detail & Related papers (2025-03-26T16:28:04Z)
Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework.<n>This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings.<n>Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z)
RoRA-VLM: Robust Retrieval-Augmented Vision Language Models [41.09545760534495]
RORA-VLM is a novel and robust retrieval augmentation framework specifically tailored for vision-language models. We conduct extensive experiments to validate the effectiveness and robustness of our proposed methods on three widely adopted benchmark datasets.
arXiv Detail & Related papers (2024-10-11T14:51:00Z)
Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach [56.610806615527885]
A key challenge in text-video retrieval (TVR) is the information asymmetry between video and text.<n>This paper introduces a data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content.<n>We propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy.
arXiv Detail & Related papers (2024-08-14T01:24:09Z)
Multi-Modal Retrieval For Large Language Model Based Speech Recognition [15.494654232953678]
We propose multi-modal retrieval with two approaches: kNN-LM and cross-attention techniques. We show that speech-based multi-modal retrieval outperforms text based retrieval. We achieve state-of-the-art recognition results on the Spoken-Squad question answering dataset.
arXiv Detail & Related papers (2024-06-13T22:55:22Z)
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models [69.59125732317972]
We propose a simple yet effective Retrieving-to-Answer (R2A) framework for VideoQA. R2A first retrieves a set of semantically similar texts from a generic text corpus using a pre-trained multi-modal model. With both the question and the retrieved texts, a LLM can be directly used to yield a desired answer.
arXiv Detail & Related papers (2023-06-15T20:56:20Z)
End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion. We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z)
Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels. contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text. There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z)
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. We evaluate various baseline methods with and without large-scale VidL pre-training. The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.