MINOTAUR: Multi-task Video Grounding From Multimodal Queries
- URL: http://arxiv.org/abs/2302.08063v1
- Date: Thu, 16 Feb 2023 04:00:03 GMT
- Title: MINOTAUR: Multi-task Video Grounding From Multimodal Queries
- Authors: Raghav Goyal, Effrosyni Mavroudi, Xitong Yang, Sainbayar Sukhbaatar,
Leonid Sigal, Matt Feiszli, Lorenzo Torresani, Du Tran
- Abstract summary: We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
- Score: 70.08973664126873
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video understanding tasks take many forms, from action detection to visual
query localization and spatio-temporal grounding of sentences. These tasks
differ in the type of inputs (only video, or video-query pair where query is an
image region or sentence) and outputs (temporal segments or spatio-temporal
tubes). However, at their core they require the same fundamental understanding
of the video, i.e., the actors and objects in it, their actions and
interactions. So far these tasks have been tackled in isolation with
individual, highly specialized architectures, which do not exploit the
interplay between tasks. In contrast, in this paper, we present a single,
unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic
Memory benchmark which entail queries of three different forms: given an
egocentric video and a visual, textual or activity query, the goal is to
determine when and where the answer can be seen within the video. Our model
design is inspired by recent query-based approaches to spatio-temporal
grounding, and contains modality-specific query encoders and task-specific
sliding window inference that allow multi-task training with diverse input
modalities and different structured outputs. We exhaustively analyze
relationships among the tasks and illustrate that cross-task learning leads to
improved performance on each individual task, as well as the ability to
generalize to unseen tasks, such as zero-shot spatial localization of language
queries.
Related papers
- Localizing Events in Videos with Multimodal Queries [71.40602125623668]
We introduce a new benchmark, ICQ, for localizing events in videos with multimodal queries.
We include 4 styles of reference images and 5 types of refinement texts, allowing us to explore model performance across different domains.
arXiv Detail & Related papers (2024-06-14T14:35:58Z) - UniAV: Unified Audio-Visual Perception for Multi-Task Video Localization [48.519331944795304]
Video localization tasks aim to temporally locate specific instances in videos, including temporal action localization (TAL), sound event detection (SED) and audio-visual event localization (AVEL)
We present UniAV, a Unified Audio-Visual perception network, to achieve joint learning of TAL, SED and AVEL tasks for the first time.
arXiv Detail & Related papers (2024-04-04T03:28:57Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Multi-Task Learning of Object State Changes from Uncurated Videos [55.60442251060871]
We learn to temporally localize object state changes by observing people interacting with objects in long uncurated web videos.
We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods.
We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup.
arXiv Detail & Related papers (2022-11-24T09:42:46Z) - Answer-Me: Multi-Task Open-Vocabulary Visual Question Answering [43.07139534653485]
We present Answer-Me, a task-aware multi-task framework.
We pre-train a vision-language joint model, which is multi-task as well.
Results show state-of-the-art performance, zero-shot generalization, robustness to forgetting, and competitive single-task results.
arXiv Detail & Related papers (2022-05-02T14:53:13Z) - Exploring Relational Context for Multi-Task Dense Prediction [76.86090370115]
We consider a multi-task environment for dense prediction tasks, represented by a common backbone and independent task-specific heads.
We explore various attention-based contexts, such as global and local, in the multi-task setting.
We propose an Adaptive Task-Relational Context module, which samples the pool of all available contexts for each task pair.
arXiv Detail & Related papers (2021-04-28T16:45:56Z) - Learning Modality Interaction for Temporal Sentence Localization and
Event Captioning in Videos [76.21297023629589]
We propose a novel method for learning pairwise modality interactions in order to better exploit complementary information for each pair of modalities in videos.
Our method turns out to achieve state-of-the-art performances on four standard benchmark datasets.
arXiv Detail & Related papers (2020-07-28T12:40:59Z) - Deep Multimodal Feature Encoding for Video Ordering [34.27175264084648]
We present a way to learn a compact multimodal feature representation that encodes all these modalities.
Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline.
We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition.
arXiv Detail & Related papers (2020-04-05T14:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.