Structured Video Tokens @ Ego4D PNR Temporal Localization Challenge 2022
- URL: http://arxiv.org/abs/2206.07689v1
- Date: Wed, 15 Jun 2022 17:36:38 GMT
- Title: Structured Video Tokens @ Ego4D PNR Temporal Localization Challenge 2022
- Authors: Elad Ben-Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna
Rohrbach, Leonid Karlinsky, Trevor Darrell, Amir Globerson
- Abstract summary: This report describes the SViT approach for the Ego4D Point of No Return (PNR) Temporal Localization Challenge.
We propose a learning framework which demonstrates how utilizing the structure of a small number of images only available during training can improve a video model.
SViT obtains strong performance on the challenge test set with 0.656 absolute temporal localization error.
- Score: 93.98605636451806
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This technical report describes the SViT approach for the Ego4D Point of No
Return (PNR) Temporal Localization Challenge. We propose a learning framework
StructureViT (SViT for short), which demonstrates how utilizing the structure
of a small number of images only available during training can improve a video
model. SViT relies on two key insights. First, as both images and videos
contain structured information, we enrich a transformer model with a set of
\emph{object tokens} that can be used across images and videos. Second, the
scene representations of individual frames in video should "align" with those
of still images. This is achieved via a "Frame-Clip Consistency" loss, which
ensures the flow of structured information between images and videos. SViT
obtains strong performance on the challenge test set with 0.656 absolute
temporal localization error.
Related papers
- CoDeF: Content Deformation Fields for Temporally Consistent Video
Processing [89.49585127724941]
CoDeF is a new type of video representation, which consists of a canonical content field and a temporal deformation field.
We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training.
arXiv Detail & Related papers (2023-08-15T17:59:56Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - VicTR: Video-conditioned Text Representations for Activity Recognition [73.09929391614266]
We argue that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information.
We introduce Video-conditioned Text Representations (VicTR), a form of text embeddings optimized w.r.t. visual embeddings.
Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text.
arXiv Detail & Related papers (2023-04-05T16:30:36Z) - Bringing Image Scene Structure to Video via Frame-Clip Consistency of
Object Tokens [93.98605636451806]
StructureViT shows how utilizing the structure of a small number of images only available during training can improve a video model.
SViT shows strong performance improvements on multiple video understanding tasks and datasets.
arXiv Detail & Related papers (2022-06-13T17:45:05Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [80.7397409377659]
We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets.
Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction.
We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
arXiv Detail & Related papers (2021-04-01T17:48:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.