Learning Activity View-invariance Under Extreme Viewpoint Changes via Curriculum Knowledge Distillation
- URL: http://arxiv.org/abs/2504.05451v1
- Date: Mon, 07 Apr 2025 19:30:30 GMT
- Title: Learning Activity View-invariance Under Extreme Viewpoint Changes via Curriculum Knowledge Distillation
- Authors: Arjun Somayazulu, Efi Mavroudi, Changan Chen, Lorenzo Torresani, Kristen Grauman,
- Abstract summary: Methods for view-invariant learning from video rely on controlled multi-view settings with minimal scene clutter.<n>We introduce a method for learning rich video representations in the presence of such severe view-occlusions.<n>We evaluate our approach on two tasks, outperforming SOTA models on both temporal keystep grounding and fine-grained keystep recognition benchmarks.
- Score: 67.7916602652644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional methods for view-invariant learning from video rely on controlled multi-view settings with minimal scene clutter. However, they struggle with in-the-wild videos that exhibit extreme viewpoint differences and share little visual content. We introduce a method for learning rich video representations in the presence of such severe view-occlusions. We first define a geometry-based metric that ranks views at a fine-grained temporal scale by their likely occlusion level. Then, using those rankings, we formulate a knowledge distillation objective that preserves action-centric semantics with a novel curriculum learning procedure that pairs incrementally more challenging views over time, thereby allowing smooth adaptation to extreme viewpoint differences. We evaluate our approach on two tasks, outperforming SOTA models on both temporal keystep grounding and fine-grained keystep recognition benchmarks - particularly on views that exhibit severe occlusion.
Related papers
- Moment Quantization for Video Temporal Grounding [29.081100914208974]
We propose a novel Moment-Quantization based Video Temporal Grounding method (MQVTG)
MQVTG quantizes the input video into various discrete vectors to enhance the discrimination between relevant and irrelevant moments.
Our method effectively groups relevant features and separates irrelevant ones, aligning with our goal of enhancing discrimination.
arXiv Detail & Related papers (2025-04-03T05:21:14Z) - Knowledge-enhanced Multi-perspective Video Representation Learning for
Scene Recognition [33.800842679024164]
We address the problem of video scene recognition, whose goal is to learn a high-level video representation to classify scenes in videos.
Most existing works identify scenes for videos only from visual or textual information in a temporal perspective.
We propose a novel two-stream framework to model video representations from multiple perspectives.
arXiv Detail & Related papers (2024-01-09T04:37:10Z) - No More Shortcuts: Realizing the Potential of Temporal Self-Supervision [69.59938105887538]
We propose a more challenging reformulation of temporal self-supervision as frame-level (rather than clip-level) recognition tasks.
We demonstrate experimentally that our more challenging frame-level task formulations and the removal of shortcuts drastically improve the quality of features learned through temporal self-supervision.
arXiv Detail & Related papers (2023-12-20T13:20:31Z) - Exo2EgoDVC: Dense Video Captioning of Egocentric Procedural Activities Using Web Instructional Videos [25.910110689486952]
We propose a novel benchmark for cross-view knowledge transfer of dense video captioning.<n>We adapt models from web instructional videos with exocentric views to an egocentric view.<n>Our experiments confirm the effectiveness of overcoming the view change problem and knowledge transfer to egocentric views.
arXiv Detail & Related papers (2023-11-28T02:51:13Z) - Learning from Untrimmed Videos: Self-Supervised Video Representation
Learning with Hierarchical Consistency [60.756222188023635]
We propose to learn representations by leveraging more abundant information in unsupervised videos.
HiCo can generate stronger representations on untrimmed videos, it also improves the representation quality when applied to trimmed videos.
arXiv Detail & Related papers (2022-04-06T18:04:54Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Few-shot Partial Multi-view Learning [103.33865779721458]
We propose a new task called few-shot partial multi-view learning.
It focuses on overcoming the negative impact of the view-missing issue in the low-data regime.
We conduct extensive experiments to evaluate our method.
arXiv Detail & Related papers (2021-05-05T13:34:43Z) - Generalized Few-Shot Video Classification with Video Retrieval and
Feature Generation [132.82884193921535]
We argue that previous methods underestimate the importance of video feature learning and propose a two-stage approach.
We show that this simple baseline approach outperforms prior few-shot video classification methods by over 20 points on existing benchmarks.
We present two novel approaches that yield further improvement.
arXiv Detail & Related papers (2020-07-09T13:05:32Z) - Generalized Multi-view Shared Subspace Learning using View Bootstrapping [43.027427742165095]
Key objective in multi-view learning is to model the information common to multiple parallel views of a class of objects/events to improve downstream learning tasks.
We present a neural method based on multi-view correlation to capture the information shared across a large number of views by subsampling them in a view-agnostic manner during training.
Experiments on spoken word recognition, 3D object classification and pose-invariant face recognition demonstrate the robustness of view bootstrapping to model a large number of views.
arXiv Detail & Related papers (2020-05-12T20:35:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.