VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video
Internet of Things
- URL: http://arxiv.org/abs/2312.00401v1
- Date: Fri, 1 Dec 2023 07:50:53 GMT
- Title: VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video
Internet of Things
- Authors: Yaoyao Zhong, Mengshi Qi, Rui Wang, Yuhan Qiu, Yang Zhang, Huadong Ma
- Abstract summary: Video Internet of Things (VIoT) has shown full potential in collecting an unprecedented volume of video data.
To address the challenges posed by the fine-grained and interrelated vision tool usage of VIoT, we build VIoTGPT.
- Score: 35.97876618109385
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Internet of Things (VIoT) has shown full potential in collecting an
unprecedented volume of video data. Learning to schedule perceiving models and
analyzing the collected videos intelligently will be potential sparks for VIoT.
In this paper, to address the challenges posed by the fine-grained and
interrelated vision tool usage of VIoT, we build VIoTGPT, the framework based
on LLMs to correctly interact with humans, query knowledge videos, and invoke
vision models to accomplish complicated tasks. To support VIoTGPT and related
future works, we meticulously crafted the training dataset and established
benchmarks involving 11 representative vision models across three categories
based on semi-automatic annotations. To guide LLM to act as the intelligent
agent towards intelligent VIoT, we resort to ReAct instruction tuning based on
the collected VIoT dataset to learn the tool capability. Quantitative and
qualitative experimental results and analyses demonstrate the effectiveness of
VIoTGPT.
Related papers
- Video In-context Learning [46.40277880351059]
In this paper, we study video in-context learning, where the model starts from an existing video clip and generates diverse potential future sequences.
To achieve this, we provide a clear definition of the task, and train an autoregressive Transformer on video datasets.
We design various evaluation metrics, including both objective and subjective measures, to demonstrate the visual quality and semantic accuracy of generation results.
arXiv Detail & Related papers (2024-07-10T04:27:06Z) - VideoCoT: A Video Chain-of-Thought Dataset with Active Annotation Tool [21.182745175241894]
We develop an automatic annotation tool that combines machine and human experts, under the active learning paradigm.
We propose a benchmark based on the collected datasets, which exploits CoT to maximize the complex reasoning capabilities of MLLMs.
arXiv Detail & Related papers (2024-07-07T13:10:23Z) - Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs [56.391404083287235]
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach.
Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations.
We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes.
arXiv Detail & Related papers (2024-06-24T17:59:42Z) - The Embodied World Model Based on LLM with Visual Information and Prediction-Oriented Prompts [19.00518906047691]
VOYAGER is a well-known LLM-based embodied AI that enables autonomous exploration in the Minecraft world.
It has issues such as underutilization of visual data and insufficient functionality as a world model.
It was suggested that devised prompts could bring out the LLM's function as a world model.
arXiv Detail & Related papers (2024-06-02T14:50:01Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.
Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.
We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z) - Divert More Attention to Vision-Language Object Tracking [87.31882921111048]
We argue that the lack of large-scale vision-language annotated videos and ineffective vision-language interaction learning motivate us to design more effective vision-language representation for tracking.
Particularly, in this paper, we first propose a general attribute annotation strategy to decorate videos in six popular tracking benchmarks, which contributes a large-scale vision-language tracking database with more than 23,000 videos.
We then introduce a novel framework to improve tracking by learning a unified-adaptive VL representation, where the cores are the proposed asymmetric architecture search and modality mixer (ModaMixer)
arXiv Detail & Related papers (2023-07-19T15:22:06Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.