Related papers: Learning to Retrieve Videos by Asking Questions

Learning to Retrieve Videos by Asking Questions

URL: http://arxiv.org/abs/2205.05739v2
Date: Fri, 13 May 2022 16:39:43 GMT
Title: Learning to Retrieve Videos by Asking Questions
Authors: Avinash Madasu, Junier Oliva, Gedas Bertasius
Abstract summary: We propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog. The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance. We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems.
Score: 29.046045230398708
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The majority of traditional text-to-video retrieval systems operate in static environments, i.e., there is no interaction between the user and the agent beyond the initial textual query provided by the user. This can be suboptimal if the initial query has ambiguities, which would lead to many falsely retrieved videos. To overcome this limitation, we propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog. The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance. Our multimodal question generator uses (i) the video candidates retrieved during the last round of interaction with the user and (ii) the text-based dialog history documenting all previous interactions, to generate questions that incorporate both visual and linguistic cues relevant to video retrieval. Furthermore, to generate maximally informative questions, we propose an Information-Guided Supervision (IGS), which guides the question generator to ask questions that would boost subsequent video retrieval accuracy. We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems. Furthermore, we also demonstrate that our proposed approach also generalizes to the real-world settings that involve interactions with real humans, thus, demonstrating the robustness and generality of our framework

Related papers

Open-Ended and Knowledge-Intensive Video Question Answering [20.256081440725353]
We investigate knowledge-intensive video question answering (KI-VideoQA) through the lens of multi-modal retrieval-augmented generation. Our analysis examines various retrieval augmentation approaches using cutting-edge retrieval and vision language models. We achieve a substantial 17.5% improvement in accuracy on multiple choice questions in the KnowIT VQA dataset.
arXiv Detail & Related papers (2025-02-17T12:40:35Z)
VideoRAG: Retrieval-Augmented Generation over Video Corpus [57.68536380621672]
VideoRAG is a framework that dynamically retrieves videos based on their relevance with queries. VideoRAG is powered by recent Large Video Language Models (LVLMs) We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.
arXiv Detail & Related papers (2025-01-10T11:17:15Z)
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning [56.873534081386]
A new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning. We propose a query-centric audio-visual cognition network to construct a reliable multi-modal representation for three tasks. This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks.
arXiv Detail & Related papers (2024-12-18T06:43:06Z)
Enhancing Multimodal Query Representation via Visual Dialogues for End-to-End Knowledge Retrieval [26.585985828583304]
We propose an end-to-end multimodal retrieval system, Ret-XKnow, to endow a text retriever with the ability to understand multimodal queries. To effectively learn multimodal interaction, we also introduce the Visual Dialogue-to-Retrieval dataset automatically constructed from visual dialogue datasets. We demonstrate that our approach not only significantly improves retrieval performance in zero-shot settings but also achieves substantial improvements in fine-tuning scenarios.
arXiv Detail & Related papers (2024-11-13T04:32:58Z)
GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video. By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions. GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z)
iRAG: Advancing RAG for Videos with an Incremental Approach [3.486835161875852]
One-time, upfront conversion of all content in large corpus of videos into text descriptions entails high processing times. We propose an incremental RAG system called iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of video data. iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of a large corpus of videos.
arXiv Detail & Related papers (2024-04-18T16:38:02Z)
VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information. At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings. At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z)
Social Commonsense-Guided Search Query Generation for Open-Domain Knowledge-Powered Conversations [66.16863141262506]
We present a novel approach that focuses on generating internet search queries guided by social commonsense. Our proposed framework addresses passive user interactions by integrating topic tracking, commonsense response generation and instruction-driven query generation.
arXiv Detail & Related papers (2023-10-22T16:14:56Z)
Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog [83.63849872250651]
Video-grounded dialog requires profound understanding of both dialog history and video content for accurate response generation. We present an iterative search and reasoning framework, which consists of a textual encoder, a visual encoder, and a generator.
arXiv Detail & Related papers (2023-10-11T07:37:13Z)
Simple Baselines for Interactive Video Retrieval with Questions and Answers [33.17722358007974]
We propose several simple yet effective baselines for interactive video retrieval via question-answering. We employ a VideoQA model to simulate user interactions and show that this enables the productive study of the interactive retrieval task. Experiments on MSR-VTT, MSVD, and AVSD show that our framework using question-based interaction significantly improves the performance of text-based video retrieval systems.
arXiv Detail & Related papers (2023-08-21T00:32:19Z)
BridgeFormer: Bridging Video-text Retrieval with Multiple Choice Questions [38.843518809230524]
We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ) A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features. In the form of questions and answers, the semantic associations between local video-text features can be properly established. Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
arXiv Detail & Related papers (2022-01-13T09:33:54Z)
Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context. By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations. Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z)
Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge [48.905496060794114]
We describe our submission to the AVSD track of the 8th Dialogue System Technology Challenge. We adopt dot-product attention to combine text and non-text features of input video. Our systems achieve high performance in automatic metrics and obtain 5th and 6th place in human evaluation.
arXiv Detail & Related papers (2020-02-25T06:41:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.