VIoTGPT: Learning to Schedule Vision Tools in LLMs towards Intelligent Video Internet of Things
- URL: http://arxiv.org/abs/2312.00401v2
- Date: Sun, 22 Dec 2024 17:32:54 GMT
- Title: VIoTGPT: Learning to Schedule Vision Tools in LLMs towards Intelligent Video Internet of Things
- Authors: Yaoyao Zhong, Mengshi Qi, Rui Wang, Yuhan Qiu, Yang Zhang, Huadong Ma,
- Abstract summary: Video Internet of Things (VIoT) has shown full potential in collecting an unprecedented volume of video data.<n>How to schedule the domain-specific perceiving models and analyze the collected videos uniformly, efficiently, and especially intelligently is challenging.<n>We build VIoTGPT, the framework based on LLMs to correctly interact with humans, query knowledge videos, and invoke vision models to analyze multimedia data collaboratively.
- Score: 33.95453848194472
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Internet of Things (VIoT) has shown full potential in collecting an unprecedented volume of video data. How to schedule the domain-specific perceiving models and analyze the collected videos uniformly, efficiently, and especially intelligently to accomplish complicated tasks is challenging. To address the challenge, we build VIoTGPT, the framework based on LLMs to correctly interact with humans, query knowledge videos, and invoke vision models to analyze multimedia data collaboratively. To support VIoTGPT and related future works, we meticulously crafted the VIoT-Tool dataset, including the training dataset and the benchmark involving 11 representative vision models across three categories based on semi-automatic annotations. To guide LLM to act as the intelligent agent towards intelligent VIoT, we resort to the ReAct instruction tuning method based on VIoT-Tool to learn the tool capability. Quantitative and qualitative experiments and analyses demonstrate the effectiveness of VIoTGPT. We believe VIoTGPT contributes to improving human-centered experiences in VIoT applications. The project website is https://github.com/zhongyy/VIoTGPT.
Related papers
- PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding [126.15907330726067]
We build a Perception Model Language (PLM) in a fully open and reproducible framework for transparent research in image and video understanding.
We analyze standard training pipelines without distillation from models and explore large-scale synthetic data to identify critical data gaps.
arXiv Detail & Related papers (2025-04-17T17:59:56Z) - LAVID: An Agentic LVLM Framework for Diffusion-Generated Video Detection [14.687867348598035]
Large Vision Language Model (LVLM) has become an emerging tool for AI-generated content detection.
We propose LAVID, a novel LVLMs-based ai-generated video detection with explicit knowledge enhancement.
Our proposed pipeline automatically selects a set of explicit knowledge tools for detection, and then adaptively adjusts the structure prompt by self-rewriting.
arXiv Detail & Related papers (2025-02-20T19:34:58Z) - VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs)
VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks.
We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z) - VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI [17.763461523794806]
VidEgoThink is a benchmark for evaluating egocentric video understanding capabilities in Embodied AI.
We design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling.
We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs.
arXiv Detail & Related papers (2024-10-15T14:08:53Z) - Vitron: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing [150.0380447353081]
We present VITRON, a universal pixel-level vision LLM designed for comprehensive understanding, segmenting, and clusters of both static images and dynamic videos.
Building on top of an LLM, VITRON incorporates encoders for images, videos, and pixel-level regional visuals within its modules, while employing state-of-the-art visual specialists as its backend.
arXiv Detail & Related papers (2024-10-08T08:39:04Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool [21.182745175241894]
We develop an automatic annotation tool that combines machine and human experts, under the active learning paradigm.
We propose a benchmark based on the collected datasets, which exploits CoT to maximize the complex reasoning capabilities of MLLMs.
arXiv Detail & Related papers (2024-07-07T13:10:23Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and hallucinations.
Here, we introduce AvaTaR, a novel and automated framework that optimize an LLM agent to effectively leverage provided tools, improving performance on a given task.
arXiv Detail & Related papers (2024-06-17T04:20:02Z) - The Embodied World Model Based on LLM with Visual Information and Prediction-Oriented Prompts [19.00518906047691]
VOYAGER is a well-known LLM-based embodied AI that enables autonomous exploration in the Minecraft world.
It has issues such as underutilization of visual data and insufficient functionality as a world model.
It was suggested that devised prompts could bring out the LLM's function as a world model.
arXiv Detail & Related papers (2024-06-02T14:50:01Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.
Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.
We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z) - Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking.
Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos.
We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.