Prompting Large Language Models to Reformulate Queries for Moment
Localization
- URL: http://arxiv.org/abs/2306.03422v1
- Date: Tue, 6 Jun 2023 05:48:09 GMT
- Title: Prompting Large Language Models to Reformulate Queries for Moment
Localization
- Authors: Wenfeng Yan, Shaoxiang Chen, Zuxuan Wu, Yu-Gang Jiang
- Abstract summary: The task of moment localization is to localize a temporal moment in an untrimmed video for a given natural language query.
We make early attempts at reformulating the moment queries into a set of instructions using large language models and making them more friendly to the localization models.
- Score: 79.57593838400618
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of moment localization is to localize a temporal moment in an
untrimmed video for a given natural language query. Since untrimmed video
contains highly redundant contents, the quality of the query is crucial for
accurately localizing moments, i.e., the query should provide precise
information about the target moment so that the localization model can
understand what to look for in the videos. However, the natural language
queries in current datasets may not be easy to understand for existing models.
For example, the Ego4D dataset uses question sentences as the query to describe
relatively complex moments. While being natural and straightforward for humans,
understanding such question sentences are challenging for mainstream moment
localization models like 2D-TAN. Inspired by the recent success of large
language models, especially their ability of understanding and generating
complex natural language contents, in this extended abstract, we make early
attempts at reformulating the moment queries into a set of instructions using
large language models and making them more friendly to the localization models.
Related papers
- Context-Enhanced Video Moment Retrieval with Large Language Models [22.283367604425916]
Current methods for Video Moment Retrieval (VMR) struggle to align complex situations involving specific environmental details, character descriptions, and action narratives.
We propose a Large Language Model-guided Moment Retrieval (LMR) approach that employs the extensive knowledge of Large Language Models (LLMs) to improve video context representation.
Extensive experiments demonstrate that LMR achieves state-of-the-art results, outperforming the nearest competitor by up to 3.28% and 4.06% on the challenging QVHighlights and Charades-STA benchmarks.
arXiv Detail & Related papers (2024-05-21T07:12:27Z) - LITA: Language Instructed Temporal-Localization Assistant [71.68815100776278]
We introduce time tokens that encode timestamps relative to the video length to better represent time in videos.
We also introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution.
We show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs.
arXiv Detail & Related papers (2024-03-27T22:50:48Z) - Lost in the Middle: How Language Models Use Long Contexts [88.78803442320246]
We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts.
We find that performance can degrade significantly when changing the position of relevant information.
Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.
arXiv Detail & Related papers (2023-07-06T17:54:11Z) - Test of Time: Instilling Video-Language Models with a Sense of Time [42.290970800790184]
Seven existing video-language models struggle to understand simple temporal relations.
We propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data.
We observe encouraging performance gains especially when the task needs higher time awareness.
arXiv Detail & Related papers (2023-01-05T14:14:36Z) - Language-free Training for Zero-shot Video Grounding [50.701372436100684]
Video grounding aims to localize the time interval by understanding the text and video simultaneously.
One of the most challenging issues is an extremely time- and cost-consuming annotation collection.
We present a simple yet novel training framework for video grounding in the zero-shot setting.
arXiv Detail & Related papers (2022-10-24T06:55:29Z) - Internet-augmented language models through few-shot prompting for
open-domain question answering [6.573232954655063]
We capitalize on the unique few-shot capabilities offered by large-scale language models to overcome some of their challenges.
We use few-shot prompting to learn to condition language models on information returned from the web using Google Search.
We find that language models conditioned on the web surpass performance of closed-book models of similar, or even larger, model sizes in open-domain question answering.
arXiv Detail & Related papers (2022-03-10T02:24:14Z) - VLG-Net: Video-Language Graph Matching Network for Video Grounding [57.6661145190528]
Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query.
We recast this challenge into an algorithmic graph matching problem.
We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets.
arXiv Detail & Related papers (2020-11-19T22:32:03Z) - DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.