Related papers: SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video

SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video

URL: http://arxiv.org/abs/2406.09462v1
Date: Thu, 13 Jun 2024 03:57:38 GMT
Title: SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video
Authors: Hector A. Valdez, Kyle Min, Subarna Tripathi,
Abstract summary: We pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsification. Most notably, SViTT-Ego obtains a +2.8% gain on EgoMCQ (intra-video) accuracy compared to LAVILA large.
Score: 11.198924693073353
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Pretraining egocentric vision-language models has become essential to improving downstream egocentric video-text tasks. These egocentric foundation models commonly use the transformer architecture. The memory footprint of these models during pretraining can be substantial. Therefore, we pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsification. We pretrain on the EgoClip dataset and incorporate the egocentric-friendly objective EgoNCE, instead of the frequently used InfoNCE. Most notably, SViTT-Ego obtains a +2.8% gain on EgoMCQ (intra-video) accuracy compared to LAVILA large, with no additional data augmentation techniques other than standard image augmentations, yet pretrainable on memory-limited devices.

Related papers

Fine-grained Spatiotemporal Grounding on Egocentric Videos [13.319346673043286]
We introduce EgoMask, the first pixel-level benchmark for fine-temporal grounding in egocentric videos.<n>EgoMask is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks.<n>We also create EgoMask-Train, a large-scale training dataset to facilitate model development.
arXiv Detail & Related papers (2025-08-01T10:53:27Z)
EgoExo-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos [49.24266108952835]
Given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate futur frames of the ego-centric video. EgoExo-Gen explicitly models the hand-object dynamics for cross-view video prediction.
arXiv Detail & Related papers (2025-04-16T03:12:39Z)
EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? [48.702973928321946]
We introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++. Our experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks.
arXiv Detail & Related papers (2024-05-28T00:27:29Z)
X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization [56.75782714530429]
We propose a cross-modal adaptation framework, which we call X-MIC. Our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space. This results in an enhanced alignment of text embeddings to each egocentric video, leading to a significant improvement in cross-dataset generalization.
arXiv Detail & Related papers (2024-03-28T19:45:35Z)
Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos. We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z)
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone [67.13773226242242]
Video-language pre-training can generalize to various vision and language tasks. Video-language pre-training frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning. New generation of egocentric video-language pre-training incorporates cross-modal fusion directly into the video and language backbones.
arXiv Detail & Related papers (2023-07-11T17:50:15Z)
EgoViT: Pyramid Video Transformer for Egocentric Action Recognition [18.05706639179499]
Capturing interaction of hands with objects is important to autonomously detect human actions from egocentric videos. We present a pyramid video transformer with a dynamic class token generator for egocentric action recognition.
arXiv Detail & Related papers (2023-03-15T20:33:50Z)
Egocentric Video-Language Pretraining [74.04740069230692]
Video-Language Pretraining aims to learn transferable representation to advance a wide range of video-text downstream tasks. We exploit the recently released Ego4D dataset to pioneer Egocentric training along three directions. We demonstrate strong performance on five egocentric downstream tasks across three datasets.
arXiv Detail & Related papers (2022-06-03T16:28:58Z)
Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos [92.38049744463149]
We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets. Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties. Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models.
arXiv Detail & Related papers (2021-04-16T06:10:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.