VC-Agent: An Interactive Agent for Customized Video Dataset Collection
- URL: http://arxiv.org/abs/2509.21291v1
- Date: Thu, 25 Sep 2025 15:08:28 GMT
- Title: VC-Agent: An Interactive Agent for Customized Video Dataset Collection
- Authors: Yidan Zhang, Mutian Xu, Yiming Hao, Kun Zhou, Jiahao Chang, Xiaoqiang Liu, Pengfei Wan, Hongbo Fu, Xiaoguang Han,
- Abstract summary: We propose VC-Agent, an interactive agent that understands users' queries and feedback, and accordingly retrieves/scales up relevant video clips with minimal user input.<n>As for agent functions, we leverage existing multi-modal large language models to connect the user's requirements with the video content.<n>We provide a new benchmark for personalized video dataset collection, and carefully conduct the user study to verify our agent's usage in various real scenarios.
- Score: 48.65498668743145
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Facing scaling laws, video data from the internet becomes increasingly important. However, collecting extensive videos that meet specific needs is extremely labor-intensive and time-consuming. In this work, we study the way to expedite this collection process and propose VC-Agent, the first interactive agent that is able to understand users' queries and feedback, and accordingly retrieve/scale up relevant video clips with minimal user input. Specifically, considering the user interface, our agent defines various user-friendly ways for the user to specify requirements based on textual descriptions and confirmations. As for agent functions, we leverage existing multi-modal large language models to connect the user's requirements with the video content. More importantly, we propose two novel filtering policies that can be updated when user interaction is continually performed. Finally, we provide a new benchmark for personalized video dataset collection, and carefully conduct the user study to verify our agent's usage in various real scenarios. Extensive experiments demonstrate the effectiveness and efficiency of our agent for customized video dataset collection. Project page: https://allenyidan.github.io/vcagent_page/.
Related papers
- UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist [107.04196084992907]
We introduce UniVA, an omni-capable multi-agent framework for next-generation video generalists.<n>UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow.<n>We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation.
arXiv Detail & Related papers (2025-11-11T17:58:13Z) - CAViAR: Critic-Augmented Video Agentic Reasoning [90.48729440775223]
We ask: can perception capabilities be leveraged to perform more complex video reasoning?<n>We develop a large language model agent given access to video modules as subagents or tools.<n>We show that the combination of our agent and critic achieve strong performance on datasets.
arXiv Detail & Related papers (2025-09-09T17:59:39Z) - AppAgent-Pro: A Proactive GUI Agent System for Multidomain Information Integration and User Assistance [64.78994124332989]
AppAgent-Pro is a proactive GUI agent system that actively integrates multi-domain information based on user instructions.<n>AppAgent-Pro has the potential to fundamentally redefine information acquisition in daily life.
arXiv Detail & Related papers (2025-08-26T05:23:24Z) - HIPPO-Video: Simulating Watch Histories with Large Language Models for Personalized Video Highlighting [27.92094212778288]
We introduce HIPPO-Video, a novel dataset for personalized video highlighting.<n>The dataset includes 2,040 (watch history, saliency score) pairs, covering 20,400 videos across 170 semantic categories.<n>To validate our dataset, we propose HiPHer, a method that leverages these personalized watch histories to predict preference-conditioned segment-wise saliency scores.
arXiv Detail & Related papers (2025-07-22T08:24:33Z) - PersonaAgent: When Large Language Model Agents Meet Personalization at Test Time [87.99027488664282]
PersonaAgent is a framework designed to address versatile personalization tasks.<n>It integrates a personalized memory module and a personalized action module.<n>Test-time user-preference alignment strategy ensures real-time user preference alignment.
arXiv Detail & Related papers (2025-06-06T17:29:49Z) - Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding [60.88843818016968]
Long-form video understanding presents significant challenges due to temporal-spatial complexity and difficulty of question answering.<n>We propose the Deep Video Discovery (DVD) agent to leverage an agentic search strategy over segmented video clips.<n>Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of 74.2%.
arXiv Detail & Related papers (2025-05-23T16:37:36Z) - Agent-based Video Trimming [17.519404251018308]
We introduce a novel task called Video Trimming (VT)<n>VT focuses on detecting wasted footage, selecting valuable segments, and composing them into a final video with a coherent story.<n>AVT received more favorable evaluations in user studies and demonstrated superior mAP and precision on the YouTube Highlights, TVSum, and our own dataset for the highlight detection task.
arXiv Detail & Related papers (2024-12-12T17:59:28Z) - Personalized Video Summarization by Multimodal Video Understanding [2.1372652192505703]
We present a pipeline called Video Summarization with Language (VSL) for user-preferred video summarization.
VSL is based on pre-trained visual language models (VLMs) to avoid the need to train a video summarization system on a large training dataset.
We show that our method is more adaptable across different datasets compared to supervised query-based video summarization models.
arXiv Detail & Related papers (2024-11-05T22:14:35Z) - On Generative Agents in Recommendation [58.42840923200071]
Agent4Rec is a user simulator in recommendation based on Large Language Models.
Each agent interacts with personalized recommender models in a page-by-page manner.
arXiv Detail & Related papers (2023-10-16T06:41:16Z) - Learning to Retrieve Videos by Asking Questions [29.046045230398708]
We propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog.
The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance.
We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems.
arXiv Detail & Related papers (2022-05-11T19:14:39Z) - IntentVizor: Towards Generic Query Guided Interactive Video
Summarization Using Slow-Fast Graph Convolutional Networks [2.5234156040689233]
IntentVizor is an interactive video summarization framework guided by genric multi-modality queries.
We use a set of intents to represent the inputs of users to design our new interactive visual analytic interface.
arXiv Detail & Related papers (2021-09-30T03:44:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.