Video + CLIP Baseline for Ego4D Long-term Action Anticipation
- URL: http://arxiv.org/abs/2207.00579v1
- Date: Fri, 1 Jul 2022 17:57:28 GMT
- Title: Video + CLIP Baseline for Ego4D Long-term Action Anticipation
- Authors: Srijan Das and Michael S. Ryoo
- Abstract summary: Video + CLIP framework makes use of a large-scale pre-trained paired image-text model: CLIP and a video encoder Slowfast network.
We show that the features obtained from both encoders are complementary to each other, thus outperforming the baseline on Ego4D for the task of long-term action anticipation.
- Score: 50.544635516455116
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In this report, we introduce our adaptation of image-text models for
long-term action anticipation. Our Video + CLIP framework makes use of a
large-scale pre-trained paired image-text model: CLIP and a video encoder
Slowfast network. The CLIP embedding provides fine-grained understanding of
objects relevant for an action whereas the slowfast network is responsible for
modeling temporal information within a video clip of few frames. We show that
the features obtained from both encoders are complementary to each other, thus
outperforming the baseline on Ego4D for the task of long-term action
anticipation. Our code is available at
github.com/srijandas07/clip_baseline_LTA_Ego4d.
Related papers
- PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance [44.08446730529495]
We propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation.
Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short.
arXiv Detail & Related papers (2024-11-04T17:50:36Z) - Efficient Long-Short Temporal Attention Network for Unsupervised Video
Object Segmentation [23.645412918420906]
Unsupervised Video Object (VOS) aims at identifying the contours of primary foreground objects in videos without any prior knowledge.
Previous methods do not fully use spatial-temporal context and fail to tackle this challenging task in real-time.
This motivates us to develop an efficient Long-Short Temporal Attention network (termed LSTA) for unsupervised VOS task from a holistic view.
arXiv Detail & Related papers (2023-09-21T01:09:46Z) - Towards Smooth Video Composition [59.134911550142455]
Video generation requires consistent and persistent frames with dynamic content over time.
This work investigates modeling the temporal relations for composing video with arbitrary length, from a few frames to even infinite, using generative adversarial networks (GANs)
We show that the alias-free operation for single image generation, together with adequately pre-learned knowledge, brings a smooth frame transition without compromising the per-frame quality.
arXiv Detail & Related papers (2022-12-14T18:54:13Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - A CLIP-Hitchhiker's Guide to Long Video Retrieval [84.36155238161462]
We study the adaptation of image-text models for long video retrieval.
Recent works have demonstrated state-of-the-art performance in video retrieval by adopting CLIP.
We find that the simple yet effective baseline of weighted-mean of frame embeddings via query-scoring is a significant improvement.
arXiv Detail & Related papers (2022-05-17T17:26:23Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.