Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
- URL: http://arxiv.org/abs/2308.07918v1
- Date: Tue, 15 Aug 2023 17:58:11 GMT
- Title: Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
- Authors: Chuhan Zhang, Ankush Gupta, Andrew Zisserman
- Abstract summary: We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
- Score: 60.350851196619296
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce an object-aware decoder for improving the performance of
spatio-temporal representations on ego-centric videos. The key idea is to
enhance object-awareness during training by tasking the model to predict hand
positions, object positions, and the semantic label of the objects using paired
captions when available. At inference time the model only requires RGB frames
as inputs, and is able to track and ground objects (although it has not been
trained explicitly for this). We demonstrate the performance of the
object-aware representations learnt by our model, by: (i) evaluating it for
strong transfer, i.e. through zero-shot testing, on a number of downstream
video-text retrieval and classification benchmarks; and (ii) by using the
representations learned as input for long-term video understanding tasks (e.g.
Episodic Memory in Ego4D). In all cases the performance improves over the state
of the art -- even compared to networks trained with far larger batch sizes. We
also show that by using noisy image-level detection as pseudo-labels in
training, the model learns to provide better bounding boxes using video
consistency, as well as grounding the words in the associated text
descriptions. Overall, we show that the model can act as a drop-in replacement
for an ego-centric video model to improve performance through visual-text
grounding.
Related papers
- Rethinking Image-to-Video Adaptation: An Object-centric Perspective [61.833533295978484]
We propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective.
Inspired by human perception, we integrate a proxy task of object discovery into image-to-video transfer learning.
arXiv Detail & Related papers (2024-07-09T13:58:10Z) - Vamos: Versatile Action Models for Video Understanding [23.631145570126268]
We propose versatile action models (Vamos), a learning framework powered by a large language model as the reasoner''
We evaluate Vamos on five benchmarks, Ego4D, NeXT-QA, IntentQA, Spacewalk-18, and Ego on its capability to model temporal dynamics, encode visual history, and perform reasoning.
arXiv Detail & Related papers (2023-11-22T17:44:24Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - Spatio-Temporal Graph for Video Captioning with Knowledge Distillation [50.034189314258356]
We propose a graph model for video captioning that exploits object interactions in space and time.
Our model builds interpretable links and is able to provide explicit visual grounding.
To avoid correlations caused by the variable number of objects, we propose an object-aware knowledge distillation mechanism.
arXiv Detail & Related papers (2020-03-31T03:58:11Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.