VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video
Internet of Things
- URL: http://arxiv.org/abs/2312.00401v1
- Date: Fri, 1 Dec 2023 07:50:53 GMT
- Title: VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video
Internet of Things
- Authors: Yaoyao Zhong, Mengshi Qi, Rui Wang, Yuhan Qiu, Yang Zhang, Huadong Ma
- Abstract summary: Video Internet of Things (VIoT) has shown full potential in collecting an unprecedented volume of video data.
To address the challenges posed by the fine-grained and interrelated vision tool usage of VIoT, we build VIoTGPT.
- Score: 35.97876618109385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Internet of Things (VIoT) has shown full potential in collecting an
unprecedented volume of video data. Learning to schedule perceiving models and
analyzing the collected videos intelligently will be potential sparks for VIoT.
In this paper, to address the challenges posed by the fine-grained and
interrelated vision tool usage of VIoT, we build VIoTGPT, the framework based
on LLMs to correctly interact with humans, query knowledge videos, and invoke
vision models to accomplish complicated tasks. To support VIoTGPT and related
future works, we meticulously crafted the training dataset and established
benchmarks involving 11 representative vision models across three categories
based on semi-automatic annotations. To guide LLM to act as the intelligent
agent towards intelligent VIoT, we resort to ReAct instruction tuning based on
the collected VIoT dataset to learn the tool capability. Quantitative and
qualitative experimental results and analyses demonstrate the effectiveness of
VIoTGPT.
Related papers
- VideoWorld: Exploring Knowledge Learning from Unlabeled Videos [119.35107657321902]
This work explores whether a deep generative model can learn complex knowledge solely from visual input.
We develop VideoWorld, an auto-regressive video generation model trained on unlabeled video data, and test its knowledge acquisition abilities in video-based Go and robotic control tasks.
arXiv Detail & Related papers (2025-01-16T18:59:10Z) - VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs)
VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks.
We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z) - VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI [17.763461523794806]
VidEgoThink is a benchmark for evaluating egocentric video understanding capabilities in Embodied AI.
We design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling.
We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs.
arXiv Detail & Related papers (2024-10-15T14:08:53Z) - FLAME: Learning to Navigate with Multimodal LLM in Urban Environments [12.428873051106702]
Large Language Models (LLMs) have demonstrated potential in Vision-and-Language Navigation (VLN) tasks.
LLMs struggle with specialized navigation tasks, yielding suboptimal performance compared to specialized VLN models.
We introduce FLAME, a novel Multimodal LLM-based agent and architecture designed for urban VLN tasks.
arXiv Detail & Related papers (2024-08-20T17:57:46Z) - VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool [21.182745175241894]
We develop an automatic annotation tool that combines machine and human experts, under the active learning paradigm.
We propose a benchmark based on the collected datasets, which exploits CoT to maximize the complex reasoning capabilities of MLLMs.
arXiv Detail & Related papers (2024-07-07T13:10:23Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [61.143381152739046]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - AvaTaR: Optimizing LLM Agents for Tool Usage via Contrastive Reasoning [93.96463520716759]
Large language model (LLM) agents have demonstrated impressive capabilities in utilizing external tools and knowledge to boost accuracy and hallucinations.
Here, we introduce AvaTaR, a novel and automated framework that optimize an LLM agent to effectively leverage provided tools, improving performance on a given task.
arXiv Detail & Related papers (2024-06-17T04:20:02Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.
Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.
We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.