Related papers: V-Agent: An Interactive Video Search System Using Vision-Language Models

V-Agent: An Interactive Video Search System Using Vision-Language Models

URL: http://arxiv.org/abs/2512.16925v2
Date: Wed, 07 Jan 2026 06:16:41 GMT
Title: V-Agent: An Interactive Video Search System Using Vision-Language Models
Authors: SunYoung Park, Jong-Hyeon Lee, Youngjune Kim, Daegyu Sung, Younghyun Yu, Young-rok Cha, Jeongho Ju,
Abstract summary: V-Agent is a novel multi-agent platform designed for advanced video search and interactive user-system conversations.<n>Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark.
Score: 5.245473886566199
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real-world applications. The retrieval model and demo videos are available at https://huggingface.co/NCSOFT/multimodal-embedding.

Related papers

LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval [0.0]
LLandMark is a modular framework for landmark-aware multimodal video retrieval.<n>The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis.<n>A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes.
arXiv Detail & Related papers (2026-03-03T11:36:34Z)
VIRTUE: Versatile Video Retrieval Through Unified Embeddings [6.517174336539377]
We present VIRTUE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities.<n>We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search.<n>Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data samples, surpasses other MLLM-based methods on zero-shot video retrieval tasks.
arXiv Detail & Related papers (2026-01-17T23:13:38Z)
UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist [107.04196084992907]
We introduce UniVA, an omni-capable multi-agent framework for next-generation video generalists.<n>UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow.<n>We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation.
arXiv Detail & Related papers (2025-11-11T17:58:13Z)
Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video [5.732421858297378]
We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding model developed to handle the increasing complexity of real-world information needs.<n>We describe the architecture, training setup, and evaluation results of Omni-Embed-Nemotron, and demonstrate its effectiveness in text, image, and video retrieval.
arXiv Detail & Related papers (2025-10-03T19:29:50Z)
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents [105.43882565434444]
We propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms.<n>First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types.<n>Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs.
arXiv Detail & Related papers (2025-07-07T00:51:57Z)
MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion [43.725594356981254]
We create a search system that extracts text and features from both visual and audio modalities.<n> MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs.
arXiv Detail & Related papers (2025-03-26T16:28:04Z)
VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts. We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks. It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z)
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information. Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations. We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z)
Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.