VIoTGPT: Learning to Schedule Vision Tools in LLMs towards Intelligent Video Internet of Things
- URL: http://arxiv.org/abs/2312.00401v2
- Date: Sun, 22 Dec 2024 17:32:54 GMT
- Title: VIoTGPT: Learning to Schedule Vision Tools in LLMs towards Intelligent Video Internet of Things
- Authors: Yaoyao Zhong, Mengshi Qi, Rui Wang, Yuhan Qiu, Yang Zhang, Huadong Ma,
- Abstract summary: Video Internet of Things (VIoT) has shown full potential in collecting an unprecedented volume of video data.
How to schedule the domain-specific perceiving models and analyze the collected videos uniformly, efficiently, and especially intelligently is challenging.
We build VIoTGPT, the framework based on LLMs to correctly interact with humans, query knowledge videos, and invoke vision models to analyze multimedia data collaboratively.
- Score: 33.95453848194472
- License:
- Abstract: Video Internet of Things (VIoT) has shown full potential in collecting an unprecedented volume of video data. How to schedule the domain-specific perceiving models and analyze the collected videos uniformly, efficiently, and especially intelligently to accomplish complicated tasks is challenging. To address the challenge, we build VIoTGPT, the framework based on LLMs to correctly interact with humans, query knowledge videos, and invoke vision models to analyze multimedia data collaboratively. To support VIoTGPT and related future works, we meticulously crafted the VIoT-Tool dataset, including the training dataset and the benchmark involving 11 representative vision models across three categories based on semi-automatic annotations. To guide LLM to act as the intelligent agent towards intelligent VIoT, we resort to the ReAct instruction tuning method based on VIoT-Tool to learn the tool capability. Quantitative and qualitative experiments and analyses demonstrate the effectiveness of VIoTGPT. We believe VIoTGPT contributes to improving human-centered experiences in VIoT applications. The project website is https://github.com/zhongyy/VIoTGPT.
Related papers
- VideoWorld: Exploring Knowledge Learning from Unlabeled Videos [119.35107657321902]
This work explores whether a deep generative model can learn complex knowledge solely from visual input.
We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks.
arXiv Detail & Related papers (2025-01-16T18:59:10Z) - VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs)
VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks.
We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z) - VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI [17.763461523794806]
VidEgoThink is a benchmark for evaluating egocentric video understanding capabilities in Embodied AI.
We design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling.
We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs.
arXiv Detail & Related papers (2024-10-15T14:08:53Z) - FLAME: Learning to Navigate with Multimodal LLM in Urban Environments [12.428873051106702]
Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks.
LLMs struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models.
We introduce FLAME, a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks.
arXiv Detail & Related papers (2024-08-20T17:57:46Z) - VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool [21.182745175241894]
We develop an automatic annotation tool that combines machine and human experts, under the active learning paradigm.
We propose a benchmark based on the collected datasets, which exploits CoT to maximize the complex reasoning capabilities of MLLMs.
arXiv Detail & Related papers (2024-07-07T13:10:23Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and hallucinations.
Here, we introduce AvaTaR, a novel and automated framework that optimize an LLM agent to effectively leverage provided tools, improving performance on a given task.
arXiv Detail & Related papers (2024-06-17T04:20:02Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.
Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.
We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.