Related papers: Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models

URL: http://arxiv.org/abs/2501.07972v1
Date: Tue, 14 Jan 2025 09:45:10 GMT
Title: Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models
Authors: Yifang Xu, Yunzhuo Sun, Benxiang Zhai, Ming Li, Wenxin Liang, Yang Li, Sidan Du,
Abstract summary: This paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs.<n>We first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively.<n>Our proposed method substantially outperforms the state-ofthe-art MLLM-based and zero-shot models on several public datasets.
Score: 7.213221003652941
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The target of video moment retrieval (VMR) is predicting temporal spans within a video that semantically match a given linguistic query. Existing VMR methods based on multimodal large language models (MLLMs) overly rely on expensive high-quality datasets and time-consuming fine-tuning. Although some recent studies introduce a zero-shot setting to avoid fine-tuning, they overlook inherent language bias in the query, leading to erroneous localization. To tackle the aforementioned challenges, this paper proposes Moment-GPT, a tuning-free pipeline for zero-shot VMR utilizing frozen MLLMs. Specifically, we first employ LLaMA-3 to correct and rephrase the query to mitigate language bias. Subsequently, we design a span generator combined with MiniGPT-v2 to produce candidate spans adaptively. Finally, to leverage the video comprehension capabilities of MLLMs, we apply VideoChatGPT and span scorer to select the most appropriate spans. Our proposed method substantially outperforms the state-ofthe-art MLLM-based and zero-shot models on several public datasets, including QVHighlights, ActivityNet-Captions, and Charades-STA.

Related papers

Tempo-R0: A Video-MLLM for Temporal Video Grounding through Efficient Temporal Sensing Reinforcement Learning [6.9627404612894335]
Temporal Video Grounding (TVG) requires pinpointing relevant temporal segments from video based on language query.<n>We propose Tempo-R0: a Video Multimodal Large Language Model (Video-MLLM) for the temporal video grounding task.<n>Our method accomplishes a notable advantage over SOTA solutions by around 3.5% on the original QVHighlights testbench.
arXiv Detail & Related papers (2025-07-07T06:51:40Z)
Universal Video Temporal Grounding with Generative Multi-modal Large Language Models [59.781211641591405]
This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries.<n>We propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs)<n>Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries.
arXiv Detail & Related papers (2025-06-23T17:53:18Z)
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model [63.14883657299359]
Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering. tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert, where distribution shifts between pre-training and target datasets constrain target performance, and OpenWorld Stabilization, where catastrophic forgetting erases the model general knowledge.
arXiv Detail & Related papers (2025-03-06T15:29:13Z)
VidLBEval: Benchmarking and Mitigating Language Bias in Video-Involved LVLMs [37.52094200472755]
This paper reveals a largely under-explored problem from existing video-involved LVLMs - language bias. We first collect a Video Language Bias Evaluation Benchmark, which is specifically designed to assess the language bias in video-involved LVLMs. We also propose Multi-branch Contrastive Decoding (MCD), introducing two expert branches to simultaneously counteract language bias.
arXiv Detail & Related papers (2025-02-23T15:04:23Z)
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling. We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos. Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z)
Fine-grained Video-Text Retrieval: A New Benchmark and Method [25.2967056489715]
We present FIBER, a FIne-grained BEnchmark for text to video Retrieval, containing 1,000 videos sourced from FineAction dataset.<n>Uniquely, our FIBER benchmark provides detailed human-annotated spatial annotations and temporal annotations for each video.<n>Experiment results show that our Video Large Language (VLLE) performs comparably to CLIP-based models on traditional benchmarks.
arXiv Detail & Related papers (2024-12-31T15:53:50Z)
LLaVA-MR: Large Language-and-Vision Assistant for Video Moment Retrieval [14.136397687227111]
We propose the Large Language-and-Vision Assistant for Moment Retrieval (LLaVA-MR) LLaVA-MR enables accurate moment retrieval and contextual grounding in videos using Multimodal Large Language Models (MLLMs) Evaluations on benchmarks like Charades-STA and QVHighlights demonstrate that LLaVA-MR outperforms 11 state-of-the-art methods.
arXiv Detail & Related papers (2024-11-21T09:34:23Z)
Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language. Currently, instruction-tuned large language models (LLMs) excel at various English tasks. Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z)
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning [42.928144657587325]
This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding. TimeSuite provides a successful solution to enhance the long video understanding capability of short-form MLLM. In addition, we introduce the TimePro, a comprehensive grounding-centric instruction dataset composed of 9 tasks and 349k high-quality grounded annotations.
arXiv Detail & Related papers (2024-10-25T17:19:55Z)
The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute. This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval. We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z)
Context-Enhanced Video Moment Retrieval with Large Language Models [22.283367604425916]
Current methods for Video Moment Retrieval (VMR) struggle to align complex situations involving specific environmental details, character descriptions, and action narratives. We propose a Large Language Model-guided Moment Retrieval (LMR) approach that employs the extensive knowledge of Large Language Models (LLMs) to improve video context representation. Extensive experiments demonstrate that LMR achieves state-of-the-art results, outperforming the nearest competitor by up to 3.28% and 4.06% on the challenging QVHighlights and Charades-STA benchmarks.
arXiv Detail & Related papers (2024-05-21T07:12:27Z)
MLLMs-Augmented Visual-Language Representation Learning [70.5293060238008]
We demonstrate that Multi-modal Large Language Models (MLLMs) can enhance visual-language representation learning. Our approach is simple, utilizing MLLMs to extend multiple diverse captions for each image. We propose "text shearing" to maintain the quality and availability of extended captions.
arXiv Detail & Related papers (2023-11-30T18:05:52Z)
LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality. We propose LLMRefine, an inference time optimization method to refine LLM's output. We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization. LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z)
MVMR: A New Framework for Evaluating Faithfulness of Video Moment Retrieval against Multiple Distractors [24.858928681280634]
We propose the MVMR (Massive Videos Moment Retrieval for Faithfulness Evaluation) task. It aims to retrieve video moments within a massive video set, including multiple distractors, to evaluate the faithfulness of VMR models. For this task, we suggest an automated massive video pool construction framework to categorize negative (distractors) and positive (false-negative) video sets.
arXiv Detail & Related papers (2023-08-15T17:38:55Z)
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models [69.59125732317972]
We propose a simple yet effective Retrieving-to-Answer (R2A) framework for VideoQA. R2A first retrieves a set of semantically similar texts from a generic text corpus using a pre-trained multi-modal model. With both the question and the retrieved texts, a LLM can be directly used to yield a desired answer.
arXiv Detail & Related papers (2023-06-15T20:56:20Z)
Multi-video Moment Ranking with Multimodal Clue [69.81533127815884]
State-of-the-art work for VCMR is based on two-stage method. MINUTE outperforms the baselines on TVR and DiDeMo datasets.
arXiv Detail & Related papers (2023-01-29T18:38:13Z)
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models [89.71617065426146]
Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Recent methods consider zero-shot settings with no manual annotation of visual question-answer. We build on frozen autoregressive language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA.
arXiv Detail & Related papers (2022-06-16T13:18:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.