ViSpeak: Visual Instruction Feedback in Streaming Videos
- URL: http://arxiv.org/abs/2503.12769v1
- Date: Mon, 17 Mar 2025 03:05:31 GMT
- Title: ViSpeak: Visual Instruction Feedback in Streaming Videos
- Authors: Shenghao Fu, Qize Yang, Yuan-Ming Li, Yi-Xing Peng, Kun-Yu Lin, Xihan Wei, Jian-Fang Hu, Xiaohua Xie, Wei-Shi Zheng,
- Abstract summary: We propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them.<n>We propose the ViSpeak model, which is a SOTA streaming video understanding LMM with GPT-4o-level performance on various streaming video understanding benchmarks.
- Score: 50.99067964073338
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in Large Multi-modal Models (LMMs) are primarily focused on offline video understanding. Instead, streaming video understanding poses great challenges to recent models due to its time-sensitive, omni-modal and interactive characteristics. In this work, we aim to extend the streaming video understanding from a new perspective and propose a novel task named Visual Instruction Feedback in which models should be aware of visual contents and learn to extract instructions from them. For example, when users wave their hands to agents, agents should recognize the gesture and start conversations with welcome information. Thus, following instructions in visual modality greatly enhances user-agent interactions. To facilitate research, we define seven key subtasks highly relevant to visual modality and collect the ViSpeak-Instruct dataset for training and the ViSpeak-Bench for evaluation. Further, we propose the ViSpeak model, which is a SOTA streaming video understanding LMM with GPT-4o-level performance on various streaming video understanding benchmarks. After finetuning on our ViSpeak-Instruct dataset, ViSpeak is equipped with basic visual instruction feedback ability, serving as a solid baseline for future research.
Related papers
- PVChat: Personalized Video Chat with One-Shot Learning [15.328085576102106]
PVChat is a one-shot learning framework that enables subject-aware question answering from a single video for each subject.
Our approach optimize a Mixture-of-Heads (MoH) enhanced ViLLM on a synthetically augmented video-QA dataset.
We evaluate PVChat on diverse datasets covering medical scenarios, TV series, anime, and real-world footage.
arXiv Detail & Related papers (2025-03-21T11:50:06Z) - OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction [95.6266030753644]
Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions.<n>Existing approaches require fine-tuning pre-trained vision-language models (VLMs) as visual and language features are independently fed into downstream policies.<n>We propose OTTER, a novel VLA architecture that leverages existing alignments through explicit, text-aware visual feature extraction.
arXiv Detail & Related papers (2025-03-05T18:44:48Z) - PreMind: Multi-Agent Video Understanding for Advanced Indexing of Presentation-style Videos [22.39414772037232]
PreMind is a novel multi-agent multimodal framework for understanding/indexing lecture videos.<n>It generates multimodal indexes through three key steps: extracting slide visual content, transcribing speech narratives, and consolidating these visual and speech contents into an integrated understanding.<n>Three innovative mechanisms are also proposed to improve performance: leveraging prior lecture knowledge to refine visual understanding, detecting/correcting speech transcription errors using a VLM, and utilizing a critic agent for dynamic iterative self-reflection in vision analysis.
arXiv Detail & Related papers (2025-02-28T20:17:48Z) - 3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation [13.622700558266658]
We propose using frozen pre-trained vision-language models (VLM) as backbones, with a specific emphasis on enhancing cross-modal feature interaction.
Firstly, we use frozen convolutional CLIP backbone to generate feature-aligned vision and text features, alleviating the issue of domain gap.
Secondly, we add more cross-modal feature fusion in the pipeline to enhance the utilization of multi-modal information.
arXiv Detail & Related papers (2024-06-07T11:15:03Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - Valley: Video Assistant with Large Language model Enhanced abilitY [41.79449203718827]
We introduce Valley, a Video Assistant with Large Language model Enhanced abilitY.
To empower Valley with video comprehension and instruction-following capabilities, we construct a video instruction dataset.
We employ ChatGPT to facilitate the construction of task-oriented conversation data.
arXiv Detail & Related papers (2023-06-12T16:11:10Z) - Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models [59.525108086957296]
Video-ChatGPT is a multimodal model that merges a video-adapted visual encoder with an LLM.
It is capable of understanding and generating detailed conversations about videos.
We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT.
arXiv Detail & Related papers (2023-06-08T17:59:56Z) - End-to-End Multimodal Representation Learning for Video Dialog [5.661732643450332]
This study proposes a new framework that combines 3D-CNN network and transformer-based networks into a single visual encoder.
The visual encoder is jointly trained end-to-end with other input modalities such as text and audio.
Experiments on the AVSD task show significant improvement over baselines in both generative and retrieval tasks.
arXiv Detail & Related papers (2022-10-26T06:50:07Z) - Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision
and Language Models [67.31684040281465]
We present textbfMOV, a simple yet effective method for textbfMultimodal textbfOpen-textbfVocabulary video classification.
In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram.
arXiv Detail & Related papers (2022-07-15T17:59:11Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.