Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities
Using Web Instructional Videos
- URL: http://arxiv.org/abs/2311.16444v2
- Date: Wed, 29 Nov 2023 06:01:34 GMT
- Title: Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities
Using Web Instructional Videos
- Authors: Takehiko Ohkawa, Takuma Yagi, Taichi Nishimura, Ryosuke Furuta,
Atsushi Hashimoto, Yoshitaka Ushiku, Yoichi Sato
- Abstract summary: We propose a novel benchmark for cross-view knowledge transfer of dense video captioning.
We adapt models from web instructional videos with exocentric views to an egocentric view.
- Score: 27.209391862016574
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a novel benchmark for cross-view knowledge transfer of dense video
captioning, adapting models from web instructional videos with exocentric views
to an egocentric view. While dense video captioning (predicting time segments
and their captions) is primarily studied with exocentric videos (e.g.,
YouCook2), benchmarks with egocentric videos are restricted due to data
scarcity. To overcome the limited video availability, transferring knowledge
from abundant exocentric web videos is demanded as a practical approach.
However, learning the correspondence between exocentric and egocentric views is
difficult due to their dynamic view changes. The web videos contain mixed views
focusing on either human body actions or close-up hand-object interactions,
while the egocentric view is constantly shifting as the camera wearer moves.
This necessitates the in-depth study of cross-view transfer under complex view
changes. In this work, we first create a real-life egocentric dataset (EgoYC2)
whose captions are shared with YouCook2, enabling transfer learning between
these datasets assuming their ground-truth is accessible. To bridge the view
gaps, we propose a view-invariant learning method using adversarial training in
both the pre-training and fine-tuning stages. While the pre-training is
designed to learn invariant features against the mixed views in the web videos,
the view-invariant fine-tuning further mitigates the view gaps between both
datasets. We validate our proposed method by studying how effectively it
overcomes the view change problem and efficiently transfers the knowledge to
the egocentric domain. Our benchmark pushes the study of the cross-view
transfer into a new task domain of dense video captioning and will envision
methodologies to describe egocentric videos in natural language.
Related papers
- Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning [80.37314291927889]
We present EMBED, a method designed to transform exocentric video-language data for egocentric video representation learning.
Egocentric videos predominantly feature close-up hand-object interactions, whereas exocentric videos offer a broader perspective on human activities.
By applying both vision and language style transfer, our framework creates a new egocentric dataset.
arXiv Detail & Related papers (2024-08-07T06:10:45Z) - EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions? [48.702973928321946]
We introduce a novel asymmetric contrastive objective for EgoHOI named EgoNCE++.
Our experiments demonstrate that EgoNCE++ significantly boosts open-vocabulary HOI recognition, multi-instance retrieval, and action recognition tasks.
arXiv Detail & Related papers (2024-05-28T00:27:29Z) - POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object
Interaction in the Multi-View World [59.545114016224254]
Humans are good at translating third-person observations of hand-object interactions into an egocentric view.
We propose a Prompt-Oriented View-agnostic learning framework, which enables this view adaptation with few egocentric videos.
arXiv Detail & Related papers (2024-03-09T09:54:44Z) - Retrieval-Augmented Egocentric Video Captioning [53.2951243928289]
EgoInstructor is a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos.
We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions.
arXiv Detail & Related papers (2024-01-01T15:31:06Z) - Learning Fine-grained View-Invariant Representations from Unpaired
Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time.
To this end, we propose AE2, a self-supervised embedding approach with two key designs.
For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z) - Cross-view Action Recognition Understanding From Exocentric to Egocentric Perspective [13.776455033015216]
We introduce a novel cross-view learning approach to action recognition.
First, we present a novel geometric-based constraint into the self-attention mechanism in Transformer.
Then, we propose a new cross-view self-attention loss learned on unpaired cross-view data to enforce the self-attention mechanism learning to transfer knowledge across views.
arXiv Detail & Related papers (2023-05-25T04:14:49Z) - Self-Supervised Learning for Videos: A Survey [70.37277191524755]
Self-supervised learning has shown promise in both image and video domains.
In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain.
arXiv Detail & Related papers (2022-06-18T00:26:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.