Simple Baselines for Interactive Video Retrieval with Questions and
Answers
- URL: http://arxiv.org/abs/2308.10402v1
- Date: Mon, 21 Aug 2023 00:32:19 GMT
- Title: Simple Baselines for Interactive Video Retrieval with Questions and
Answers
- Authors: Kaiqu Liang, Samuel Albanie
- Abstract summary: We propose several simple yet effective baselines for interactive video retrieval via question-answering.
We employ a VideoQA model to simulate user interactions and show that this enables the productive study of the interactive retrieval task.
Experiments on MSR-VTT, MSVD, and AVSD show that our framework using question-based interaction significantly improves the performance of text-based video retrieval systems.
- Score: 33.17722358007974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To date, the majority of video retrieval systems have been optimized for a
"single-shot" scenario in which the user submits a query in isolation, ignoring
previous interactions with the system. Recently, there has been renewed
interest in interactive systems to enhance retrieval, but existing approaches
are complex and deliver limited gains in performance. In this work, we revisit
this topic and propose several simple yet effective baselines for interactive
video retrieval via question-answering. We employ a VideoQA model to simulate
user interactions and show that this enables the productive study of the
interactive retrieval task without access to ground truth dialogue data.
Experiments on MSR-VTT, MSVD, and AVSD show that our framework using
question-based interaction significantly improves the performance of text-based
video retrieval systems.
Related papers
- VideoRAG: Retrieval-Augmented Generation over Video Corpus [57.68536380621672]
VideoRAG is a novel framework that dynamically retrieves relevant videos based on their relevance with queries.
We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.
arXiv Detail & Related papers (2025-01-10T11:17:15Z) - Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning [56.873534081386]
A new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning.
We propose a query-centric audio-visual cognition network to construct a reliable multi-modal representation for three tasks.
This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks.
arXiv Detail & Related papers (2024-12-18T06:43:06Z) - GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video.
By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions.
GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z) - Interactive Text-to-Image Retrieval with Large Language Models: A Plug-and-Play Approach [33.231639257323536]
In this paper, we address the issue of dialogue-form context query within the interactive text-to-image retrieval task.
By reformulating the dialogue-form context, we eliminate the necessity of fine-tuning a retrieval model on existing visual dialogue data.
We construct the LLM questioner to generate non-redundant questions about the attributes of the target image.
arXiv Detail & Related papers (2024-06-05T16:09:01Z) - ProCIS: A Benchmark for Proactive Retrieval in Conversations [21.23826888841565]
We introduce a large-scale dataset for proactive document retrieval that consists of over 2.8 million conversations.
We conduct crowdsourcing experiments to obtain high-quality and relatively complete relevance judgments.
We also collect annotations related to the parts of the conversation that are related to each document, enabling us to evaluate proactive retrieval systems.
arXiv Detail & Related papers (2024-05-10T13:11:07Z) - DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries [60.09774333024783]
We introduce Dynamic Anchor Queries (DAQ) to shorten the transition gap between the anchor and target queries.
We also introduce a query-level object Emergence and Disappearance Simulation (EDS) strategy, which unleashes DAQ's potential without any additional cost.
Experiments demonstrate that DVIS-DAQ achieves a new state-of-the-art (SOTA) performance on five mainstream video segmentation benchmarks.
arXiv Detail & Related papers (2024-03-29T17:58:50Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Zero-shot Audio Topic Reranking using Large Language Models [42.774019015099704]
Multimodal Video Search by Examples (MVSE) investigates using video clips as the query term for information retrieval.
This work aims to compensate for any performance loss from this rapid archive search by examining reranking approaches.
Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus.
arXiv Detail & Related papers (2023-09-14T11:13:36Z) - Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - Learning to Retrieve Videos by Asking Questions [29.046045230398708]
We propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog.
The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance.
We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems.
arXiv Detail & Related papers (2022-05-11T19:14:39Z) - Part2Whole: Iteratively Enrich Detail for Cross-Modal Retrieval with
Partial Query [25.398090300086302]
We propose an interactive retrieval framework called Part2Whole to tackle this problem.
An Interactive Retrieval Agent is trained to build an optimal policy to refine the initial query.
We present a weakly-supervised reinforcement learning method that needs no human-annotated data other than the text-image dataset.
arXiv Detail & Related papers (2021-03-02T11:27:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.