VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
- URL: http://arxiv.org/abs/2403.14743v2
- Date: Mon, 25 Mar 2024 01:18:37 GMT
- Title: VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding
- Authors: Ahmad Mahmood, Ashmal Vayani, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan,
- Abstract summary: This paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of Large Language Models (LLMs)
Ours is a novel approach to extend the utility of LLMs in the context of video tasks.
We harness their contextual learning capabilities to generate executable visual programs for video understanding.
- Score: 65.12464615430036
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent studies have demonstrated the effectiveness of Large Language Models (LLMs) as reasoning modules that can deconstruct complex tasks into more manageable sub-tasks, particularly when applied to visual reasoning tasks for images. In contrast, this paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of LLMs. Ours is a novel approach to extend the utility of LLMs in the context of video tasks, leveraging their capacity to generalize from minimal input and output demonstrations within a contextual framework. By presenting LLMs with pairs of instructions and their corresponding high-level programs, we harness their contextual learning capabilities to generate executable visual programs for video understanding. To enhance program's accuracy and robustness, we implement two important strategies. Firstly, we employ a feedback-generation approach, powered by GPT-3.5, to rectify errors in programs utilizing unsupported functions. Secondly, taking motivation from recent works on self refinement of LLM outputs, we introduce an iterative procedure for improving the quality of the in-context examples by aligning the initial outputs to the outputs that would have been generated had the LLM not been bound by the structure of the in-context examples. Our results on several video-specific tasks, including visual QA, video anticipation, pose estimation and multi-video QA illustrate the efficacy of these enhancements in improving the performance of visual programming approaches for video tasks.
Related papers
- Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies [69.28082193942991]
This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills.
utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches.
To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR)
arXiv Detail & Related papers (2024-06-16T12:58:31Z) - From Image to Video, what do we need in multimodal LLMs? [19.85928004619801]
Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information.
We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs.
Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models.
arXiv Detail & Related papers (2024-04-18T02:43:37Z) - Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement [93.73648674743097]
Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks.
Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs.
No dataset of visual programs for training exists, and acquisition of a visual program dataset cannot be easily crowdsourced.
arXiv Detail & Related papers (2024-04-06T13:25:00Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - Look, Remember and Reason: Grounded reasoning in videos with language
models [5.3445140425713245]
Multi-temporal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos.
We propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, tracking, to endow the model with the required low-level visual capabilities.
We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets.
arXiv Detail & Related papers (2023-06-30T16:31:14Z) - VideoLLM: Modeling Video Sequence with Large Language Models [70.32832021713864]
Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks.
We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs.
VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
arXiv Detail & Related papers (2023-05-22T17:51:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.