Look, Remember and Reason: Grounded reasoning in videos with language
models
- URL: http://arxiv.org/abs/2306.17778v3
- Date: Mon, 22 Jan 2024 00:54:30 GMT
- Title: Look, Remember and Reason: Grounded reasoning in videos with language
models
- Authors: Apratim Bhattacharyya, Sunny Panchal, Mingu Lee, Reza Pourreza, Pulkit
Madan, Roland Memisevic
- Abstract summary: Multi-temporal language models (LM) have recently shown promising performance in high-level reasoning tasks on videos.
We propose training an LM end-to-end on low-level surrogate tasks, including object detection, re-identification, tracking, to endow the model with the required low-level visual capabilities.
We demonstrate the effectiveness of our framework on diverse visual reasoning tasks from the ACRE, CATER, Something-Else and STAR datasets.
- Score: 5.3445140425713245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal language models (LM) have recently shown promising performance in
high-level reasoning tasks on videos. However, existing methods still fall
short in tasks like causal or compositional spatiotemporal reasoning over
actions, in which model predictions need to be grounded in fine-grained
low-level details, such as object motions and object interactions. In this
work, we propose training an LM end-to-end on low-level surrogate tasks,
including object detection, re-identification, and tracking, to endow the model
with the required low-level visual capabilities. We show that a two-stream
video encoder with spatiotemporal attention is effective at capturing the
required static and motion-based cues in the video. By leveraging the LM's
ability to perform the low-level surrogate tasks, we can cast reasoning in
videos as the three-step process of Look, Remember, Reason wherein visual
information is extracted using low-level visual skills step-by-step and then
integrated to arrive at a final answer. We demonstrate the effectiveness of our
framework on diverse visual reasoning tasks from the ACRE, CATER,
Something-Else and STAR datasets. Our approach is trainable end-to-end and
surpasses state-of-the-art task-specific methods across these tasks by a large
margin.
Related papers
- Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding [65.12464615430036]
This paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of Large Language Models (LLMs)
Ours is a novel approach to extend the utility of LLMs in the context of video tasks.
We harness their contextual learning capabilities to generate executable visual programs for video understanding.
arXiv Detail & Related papers (2024-03-21T18:00:00Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Momentor: Advancing Video Large Language Model with Fine-Grained Temporal Reasoning [102.54669633984278]
We propose Momentor, a Video-LLM capable of accomplishing fine-grained temporal understanding tasks.
We train Momentor on Moment-10M, enabling it to perform segment-level reasoning and localization.
arXiv Detail & Related papers (2024-02-18T03:04:38Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Unifying Tracking and Image-Video Object Detection [54.91658924277527]
TrIVD (Tracking and Image-Video Detection) is the first framework that unifies image OD, video OD, and MOT within one end-to-end model.
To handle the discrepancies and semantic overlaps of category labels, TrIVD formulates detection/tracking as grounding and reasons about object categories.
arXiv Detail & Related papers (2022-11-20T20:30:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.