Learning to Retrieve Videos by Asking Questions
- URL: http://arxiv.org/abs/2205.05739v2
- Date: Fri, 13 May 2022 16:39:43 GMT
- Title: Learning to Retrieve Videos by Asking Questions
- Authors: Avinash Madasu, Junier Oliva, Gedas Bertasius
- Abstract summary: We propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog.
The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance.
We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems.
- Score: 29.046045230398708
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The majority of traditional text-to-video retrieval systems operate in static
environments, i.e., there is no interaction between the user and the agent
beyond the initial textual query provided by the user. This can be suboptimal
if the initial query has ambiguities, which would lead to many falsely
retrieved videos. To overcome this limitation, we propose a novel framework for
Video Retrieval using Dialog (ViReD), which enables the user to interact with
an AI agent via multiple rounds of dialog. The key contribution of our
framework is a novel multimodal question generator that learns to ask questions
that maximize the subsequent video retrieval performance. Our multimodal
question generator uses (i) the video candidates retrieved during the last
round of interaction with the user and (ii) the text-based dialog history
documenting all previous interactions, to generate questions that incorporate
both visual and linguistic cues relevant to video retrieval. Furthermore, to
generate maximally informative questions, we propose an Information-Guided
Supervision (IGS), which guides the question generator to ask questions that
would boost subsequent video retrieval accuracy. We validate the effectiveness
of our interactive ViReD framework on the AVSD dataset, showing that our
interactive method performs significantly better than traditional
non-interactive video retrieval systems. Furthermore, we also demonstrate that
our proposed approach also generalizes to the real-world settings that involve
interactions with real humans, thus, demonstrating the robustness and
generality of our framework
Related papers
- Enhancing Multimodal Query Representation via Visual Dialogues for End-to-End Knowledge Retrieval [26.585985828583304]
We propose an end-to-end multimodal retrieval system, Ret-XKnow, to endow a text retriever with the ability to understand multimodal queries.
To effectively learn multimodal interaction, we also introduce the Visual Dialogue-to-Retrieval dataset automatically constructed from visual dialogue datasets.
We demonstrate that our approach not only significantly improves retrieval performance in zero-shot settings but also achieves substantial improvements in fine-tuning scenarios.
arXiv Detail & Related papers (2024-11-13T04:32:58Z) - GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video.
By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions.
GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z) - iRAG: Advancing RAG for Videos with an Incremental Approach [3.486835161875852]
One-time, upfront conversion of all content in large corpus of videos into text descriptions entails high processing times.
We propose an incremental RAG system called iRAG, which augments RAG with a novel incremental workflow to enable interactive querying of video data.
iRAG is the first system to augment RAG with an incremental workflow to support efficient interactive querying of a large corpus of videos.
arXiv Detail & Related papers (2024-04-18T16:38:02Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Social Commonsense-Guided Search Query Generation for Open-Domain
Knowledge-Powered Conversations [66.16863141262506]
We present a novel approach that focuses on generating internet search queries guided by social commonsense.
Our proposed framework addresses passive user interactions by integrating topic tracking, commonsense response generation and instruction-driven query generation.
arXiv Detail & Related papers (2023-10-22T16:14:56Z) - Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog [83.63849872250651]
Video-grounded dialog requires profound understanding of both dialog history and video content for accurate response generation.
We present an iterative search and reasoning framework, which consists of a textual encoder, a visual encoder, and a generator.
arXiv Detail & Related papers (2023-10-11T07:37:13Z) - Simple Baselines for Interactive Video Retrieval with Questions and
Answers [33.17722358007974]
We propose several simple yet effective baselines for interactive video retrieval via question-answering.
We employ a VideoQA model to simulate user interactions and show that this enables the productive study of the interactive retrieval task.
Experiments on MSR-VTT, MSVD, and AVSD show that our framework using question-based interaction significantly improves the performance of text-based video retrieval systems.
arXiv Detail & Related papers (2023-08-21T00:32:19Z) - BridgeFormer: Bridging Video-text Retrieval with Multiple Choice
Questions [38.843518809230524]
We introduce a novel pretext task dubbed Multiple Choice Questions (MCQ)
A module BridgeFormer is trained to answer the "questions" constructed by the text features via resorting to the video features.
In the form of questions and answers, the semantic associations between local video-text features can be properly established.
Our method outperforms state-of-the-art methods on the popular text-to-video retrieval task in five datasets.
arXiv Detail & Related papers (2022-01-13T09:33:54Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Multimodal Transformer with Pointer Network for the DSTC8 AVSD Challenge [48.905496060794114]
We describe our submission to the AVSD track of the 8th Dialogue System Technology Challenge.
We adopt dot-product attention to combine text and non-text features of input video.
Our systems achieve high performance in automatic metrics and obtain 5th and 6th place in human evaluation.
arXiv Detail & Related papers (2020-02-25T06:41:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.