VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video
Internet of Things
- URL: http://arxiv.org/abs/2312.00401v1
- Date: Fri, 1 Dec 2023 07:50:53 GMT
- Title: VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video
Internet of Things
- Authors: Yaoyao Zhong, Mengshi Qi, Rui Wang, Yuhan Qiu, Yang Zhang, Huadong Ma
- Abstract summary: Video Internet of Things (VIoT) has shown full potential in collecting an unprecedented volume of video data.
To address the challenges posed by the fine-grained and interrelated vision tool usage of VIoT, we build VIoTGPT.
- Score: 35.97876618109385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Internet of Things (VIoT) has shown full potential in collecting an
unprecedented volume of video data. Learning to schedule perceiving models and
analyzing the collected videos intelligently will be potential sparks for VIoT.
In this paper, to address the challenges posed by the fine-grained and
interrelated vision tool usage of VIoT, we build VIoTGPT, the framework based
on LLMs to correctly interact with humans, query knowledge videos, and invoke
vision models to accomplish complicated tasks. To support VIoTGPT and related
future works, we meticulously crafted the training dataset and established
benchmarks involving 11 representative vision models across three categories
based on semi-automatic annotations. To guide LLM to act as the intelligent
agent towards intelligent VIoT, we resort to ReAct instruction tuning based on
the collected VIoT dataset to learn the tool capability. Quantitative and
qualitative experimental results and analyses demonstrate the effectiveness of
VIoTGPT.
Related papers
- VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use [74.39058448757645]
We present VipAct, an agent framework that enhances vision-language models (VLMs)
VipAct consists of an orchestrator agent, which manages task requirement analysis, planning, and coordination, along with specialized agents that handle specific tasks.
We evaluate VipAct on benchmarks featuring a diverse set of visual perception tasks, with experimental results demonstrating significant performance improvements.
arXiv Detail & Related papers (2024-10-21T18:10:26Z) - VidEgoThink: Assessing Egocentric Video Understanding Capabilities for Embodied AI [17.763461523794806]
VidEgoThink is a benchmark for evaluating egocentric video understanding capabilities in Embodied AI.
We design four key interrelated tasks: video question-answering, hierarchy planning, visual grounding and reward modeling.
We conduct extensive experiments with three types of models: API-based MLLMs, open-source image-based MLLMs, and open-source video-based MLLMs.
arXiv Detail & Related papers (2024-10-15T14:08:53Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool [21.182745175241894]
We develop an automatic annotation tool that combines machine and human experts, under the active learning paradigm.
We propose a benchmark based on the collected datasets, which exploits CoT to maximize the complex reasoning capabilities of MLLMs.
arXiv Detail & Related papers (2024-07-07T13:10:23Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - The Embodied World Model Based on LLM with Visual Information and Prediction-Oriented Prompts [19.00518906047691]
VOYAGER is a well-known LLM-based embodied AI that enables autonomous exploration in the Minecraft world.
It has issues such as underutilization of visual data and insufficient functionality as a world model.
It was suggested that devised prompts could bring out the LLM's function as a world model.
arXiv Detail & Related papers (2024-06-02T14:50:01Z) - Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking.
Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos.
We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.