V-Agent: An Interactive Video Search System Using Vision-Language Models
- URL: http://arxiv.org/abs/2512.16925v2
- Date: Wed, 07 Jan 2026 06:16:41 GMT
- Title: V-Agent: An Interactive Video Search System Using Vision-Language Models
- Authors: SunYoung Park, Jong-Hyeon Lee, Youngjune Kim, Daegyu Sung, Younghyun Yu, Young-rok Cha, Jeongho Ju,
- Abstract summary: V-Agent is a novel multi-agent platform designed for advanced video search and interactive user-system conversations.<n>Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark.
- Score: 5.245473886566199
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We introduce V-Agent, a novel multi-agent platform designed for advanced video search and interactive user-system conversations. By fine-tuning a vision-language model (VLM) with a small video preference dataset and enhancing it with a retrieval vector from an image-text retrieval model, we overcome the limitations of traditional text-based retrieval systems in multimodal scenarios. The VLM-based retrieval model independently embeds video frames and audio transcriptions from an automatic speech recognition (ASR) module into a shared multimodal representation space, enabling V-Agent to interpret both visual and spoken content for context-aware video search. This system consists of three agents-a routing agent, a search agent, and a chat agent-that work collaboratively to address user intents by refining search outputs and communicating with users. The search agent utilizes the VLM-based retrieval model together with an additional re-ranking module to further enhance video retrieval quality. Our proposed framework demonstrates state-of-the-art zero-shot performance on the MultiVENT 2.0 benchmark, highlighting its potential for both academic research and real-world applications. The retrieval model and demo videos are available at https://huggingface.co/NCSOFT/multimodal-embedding.
Related papers
- LLandMark: A Multi-Agent Framework for Landmark-Aware Multimodal Interactive Video Retrieval [0.0]
LLandMark is a modular framework for landmark-aware multimodal video retrieval.<n>The framework features specialized agents that collaborate across four stages: query parsing and planning, landmark reasoning, multimodal retrieval, and reranked answer synthesis.<n>A key component, the Landmark Knowledge Agent, detects cultural or spatial landmarks and reformulates them into descriptive visual prompts, enhancing CLIP-based semantic matching for Vietnamese scenes.
arXiv Detail & Related papers (2026-03-03T11:36:34Z) - VIRTUE: Versatile Video Retrieval Through Unified Embeddings [6.517174336539377]
We present VIRTUE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities.<n>We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search.<n>Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data samples, surpasses other MLLM-based methods on zero-shot video retrieval tasks.
arXiv Detail & Related papers (2026-01-17T23:13:38Z) - UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist [107.04196084992907]
We introduce UniVA, an omni-capable multi-agent framework for next-generation video generalists.<n>UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow.<n>We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation.
arXiv Detail & Related papers (2025-11-11T17:58:13Z) - Omni-Embed-Nemotron: A Unified Multimodal Retrieval Model for Text, Image, Audio, and Video [5.732421858297378]
We present Omni-Embed-Nemotron, a unified multimodal retrieval embedding model developed to handle the increasing complexity of real-world information needs.<n>We describe the architecture, training setup, and evaluation results of Omni-Embed-Nemotron, and demonstrate its effectiveness in text, image, and video retrieval.
arXiv Detail & Related papers (2025-10-03T19:29:50Z) - VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents [105.43882565434444]
We propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms.<n>First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types.<n>Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs.
arXiv Detail & Related papers (2025-07-07T00:51:57Z) - MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion [43.725594356981254]
We create a search system that extracts text and features from both visual and audio modalities.<n> MMMORRF is both effective and efficient, demonstrating practicality in searching videos based on users' information needs.
arXiv Detail & Related papers (2025-03-26T16:28:04Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters.
This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks.
It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z) - Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
Text Integration [50.94902442781148]
We propose a novel multi-modal large language model (LLM) that seamlessly integrates visual, audio, and textual information.
Macaw-LLM consists of three main components: a modality module for encoding multi-modal data, a cognitive module for harnessing pretrained LLMs, and an alignment module for harmonizing diverse representations.
We construct a large-scale multi-modal instruction dataset in terms of multi-turn dialogue, including 69K image instances and 50K video instances.
arXiv Detail & Related papers (2023-06-15T12:45:25Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.