PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers
using Synthetic Scene Data
- URL: http://arxiv.org/abs/2212.04821v3
- Date: Tue, 5 Dec 2023 20:40:59 GMT
- Title: PromptonomyViT: Multi-Task Prompt Learning Improves Video Transformers
using Synthetic Scene Data
- Authors: Roei Herzig, Ofir Abramovich, Elad Ben-Avraham, Assaf Arbelle, Leonid
Karlinsky, Ariel Shamir, Trevor Darrell, Amir Globerson
- Abstract summary: We propose an approach to leverage synthetic scene data for improving video understanding.
We present a multi-task prompt learning approach for video transformers.
We show strong performance improvements on multiple video understanding tasks and datasets.
- Score: 85.48684148629634
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Action recognition models have achieved impressive results by incorporating
scene-level annotations, such as objects, their relations, 3D structure, and
more. However, obtaining annotations of scene structure for videos requires a
significant amount of effort to gather and annotate, making these methods
expensive to train. In contrast, synthetic datasets generated by graphics
engines provide powerful alternatives for generating scene-level annotations
across multiple tasks. In this work, we propose an approach to leverage
synthetic scene data for improving video understanding. We present a multi-task
prompt learning approach for video transformers, where a shared video
transformer backbone is enhanced by a small set of specialized parameters for
each task. Specifically, we add a set of "task prompts", each corresponding to
a different task, and let each prompt predict task-related annotations. This
design allows the model to capture information shared among synthetic scene
tasks as well as information shared between synthetic scene tasks and a real
video downstream task throughout the entire network. We refer to this approach
as "Promptonomy", since the prompts model task-related structure. We propose
the PromptonomyViT model (PViT), a video transformer that incorporates various
types of scene-level information from synthetic data using the "Promptonomy"
approach. PViT shows strong performance improvements on multiple video
understanding tasks and datasets. Project page:
\url{https://ofir1080.github.io/PromptonomyViT}
Related papers
- TAM-VT: Transformation-Aware Multi-scale Video Transformer for Segmentation and Tracking [33.75267864844047]
Video Object (VOS) has emerged as an increasingly important problem with availability of larger datasets and more complex and realistic settings.
We propose a novel, clip-based DETR-style encoder-decoder architecture, which focuses on systematically analyzing and addressing aforementioned challenges.
Specifically, we propose a novel transformation-aware loss that focuses learning on portions of the video where an object undergoes significant deformations.
arXiv Detail & Related papers (2023-12-13T21:02:03Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - Bringing Image Scene Structure to Video via Frame-Clip Consistency of
Object Tokens [93.98605636451806]
StructureViT shows how utilizing the structure of a small number of images only available during training can improve a video model.
SViT shows strong performance improvements on multiple video understanding tasks and datasets.
arXiv Detail & Related papers (2022-06-13T17:45:05Z) - Object-aware Video-language Pre-training for Retrieval [24.543719616308945]
We present Object-aware Transformers, an object-centric approach that extends video-language transformer to incorporate object representations.
We show clear improvement in performance across all tasks and datasets considered, demonstrating the value of a model that incorporates object representations into a video-language architecture.
arXiv Detail & Related papers (2021-12-01T17:06:39Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.