UnLoc: A Unified Framework for Video Localization Tasks
- URL: http://arxiv.org/abs/2308.11062v1
- Date: Mon, 21 Aug 2023 22:15:20 GMT
- Title: UnLoc: A Unified Framework for Video Localization Tasks
- Authors: Shen Yan, Xuehan Xiong, Arsha Nagrani, Anurag Arnab, Zhonghao Wang,
Weina Ge, David Ross, Cordelia Schmid
- Abstract summary: UnLoc is a new approach for temporal localization in untrimmed videos.
It uses pretrained image and text towers, and feeds tokens to a video-text fusion model.
We achieve state of the art results on all three different localization tasks with a unified approach.
- Score: 82.59118972890262
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While large-scale image-text pretrained models such as CLIP have been used
for multiple video-level tasks on trimmed videos, their use for temporal
localization in untrimmed videos is still a relatively unexplored task. We
design a new approach for this called UnLoc, which uses pretrained image and
text towers, and feeds tokens to a video-text fusion model. The output of the
fusion module are then used to construct a feature pyramid in which each level
connects to a head to predict a per-frame relevancy score and start/end time
displacements. Unlike previous works, our architecture enables Moment
Retrieval, Temporal Localization, and Action Segmentation with a single stage
model, without the need for action proposals, motion based pretrained features
or representation masking. Unlike specialized models, we achieve state of the
art results on all three different localization tasks with a unified approach.
Code will be available at: \url{https://github.com/google-research/scenic}.
Related papers
- VidLA: Video-Language Alignment at Scale [48.665918882615195]
We propose VidLA, an approach for video-language alignment at scale.
Our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks.
arXiv Detail & Related papers (2024-03-21T22:36:24Z) - Multi-entity Video Transformers for Fine-Grained Video Representation
Learning [36.31020249963468]
We re-examine the design of transformer architectures for video representation learning.
A salient aspect of our self-supervised method is the improved integration of spatial information in the temporal pipeline.
Our Multi-entity Video Transformer (MV-Former) architecture achieves state-of-the-art results on multiple fine-grained video benchmarks.
arXiv Detail & Related papers (2023-11-17T21:23:12Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - Bringing Image Scene Structure to Video via Frame-Clip Consistency of
Object Tokens [93.98605636451806]
StructureViT shows how utilizing the structure of a small number of images only available during training can improve a video model.
SViT shows strong performance improvements on multiple video understanding tasks and datasets.
arXiv Detail & Related papers (2022-06-13T17:45:05Z) - Few-Shot Temporal Action Localization with Query Adaptive Transformer [105.84328176530303]
TAL works rely on a large number of training videos with exhaustive segment-level annotation.
Few-shot TAL aims to adapt a model to a new class represented by as few as a single video.
arXiv Detail & Related papers (2021-10-20T13:18:01Z) - STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos [17.232631075144592]
Methods for instance segmentation in videos typically follow the tracking-by-detection paradigm.
We propose a novel approach that segments and tracks instances across space and time in a single stage.
Our method achieves state-of-the-art results across multiple datasets and tasks.
arXiv Detail & Related papers (2020-03-18T18:40:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.