MINOTAUR: Multi-task Video Grounding From Multimodal Queries
- URL: http://arxiv.org/abs/2302.08063v1
- Date: Thu, 16 Feb 2023 04:00:03 GMT
- Title: MINOTAUR: Multi-task Video Grounding From Multimodal Queries
- Authors: Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar,
Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran
- Abstract summary: We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
- Score: 70.08973664126873
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video understanding tasks take many forms, from action detection to visual
query localization and spatio-temporal grounding of sentences. These tasks
differ in the type of inputs (only video, or video-query pair where query is an
image region or sentence) and outputs (temporal segments or spatio-temporal
tubes). However, at their core they require the same fundamental understanding
of the video, i.e., the actors and objects in it, their actions and
interactions. So far these tasks have been tackled in isolation with
individual, highly specialized architectures, which do not exploit the
interplay between tasks. In contrast, in this paper, we present a single,
unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic
Memory benchmark which entail queries of three different forms: given an
egocentric video and a visual, textual or activity query, the goal is to
determine when and where the answer can be seen within the video. Our model
design is inspired by recent query-based approaches to spatio-temporal
grounding, and contains modality-specific query encoders and task-specific
sliding window inference that allow multi-task training with diverse input
modalities and different structured outputs. We exhaustively analyze
relationships among the tasks and illustrate that cross-task learning leads to
improved performance on each individual task, as well as the ability to
generalize to unseen tasks, such as zero-shot spatial localization of language
queries.
Related papers
- Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks [26.007846170517055]
We propose a single unified framework, coined as Temporal2Seq, to formulate the output of temporal video understanding tasks as a sequence of discrete tokens.
With this unified token representation, Temporal2Seq can train a generalist model within a single architecture on different video understanding tasks.
We evaluate our Temporal2Seq generalist model on the corresponding test sets of three tasks, demonstrating that Temporal2Seq can produce reasonable results on various tasks.
arXiv Detail & Related papers (2024-09-27T06:37:47Z) - Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos [35.974750867072345]
This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos.
We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence.
We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal large language models.
arXiv Detail & Related papers (2024-08-26T17:58:47Z) - UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization [83.89550658314741]
Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL)
We present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time.
arXiv Detail & Related papers (2024-04-04T03:28:57Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering [43.07139534653485]
We present Answer-Me, a task-aware multi-task framework.
We pre-train a vision-language joint model, which is multi-task as well.
Results show state-of-the-art performance, zero-shot generalization, robustness to forgetting, and competitive single-task results.
arXiv Detail & Related papers (2022-05-02T14:53:13Z) - Exploring Relational Context for Multi-Task Dense Prediction [76.86090370115]
We consider a multi-task environment for dense prediction tasks, represented by a common backbone and independent task-specific heads.
We explore various attention-based contexts, such as global and local, in the multi-task setting.
We propose an Adaptive Task-Relational Context module, which samples the pool of all available contexts for each task pair.
arXiv Detail & Related papers (2021-04-28T16:45:56Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.