Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos
- URL: http://arxiv.org/abs/2502.17753v2
- Date: Wed, 26 Feb 2025 11:45:01 GMT
- Title: Task Graph Maximum Likelihood Estimation for Procedural Activity Understanding in Egocentric Videos
- Authors: Luigi Seminara, Giovanni Maria Farinella, Antonino Furnari,
- Abstract summary: gradient-based approach for learning task graphs from procedural activities.<n>We validate our approach on CaptainCook4D, EgoPER, and EgoProceL.
- Score: 13.99137623722021
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a gradient-based approach for learning task graphs from procedural activities, improving over hand-crafted methods. Our method directly optimizes edge weights via maximum likelihood, enabling integration into neural architectures. We validate our approach on CaptainCook4D, EgoPER, and EgoProceL, achieving +14.5%, +10.2%, and +13.6% F1-score improvements. Our feature-based approach for predicting task graphs from textual/video embeddings demonstrates emerging video understanding abilities. We also achieved top performance on the procedure understanding benchmark on Ego-Exo4D and significantly improved online mistake detection (+19.8% on Assembly101-O, +6.4% on EPIC-Tent-O). Code: https://github.com/fpv-iplab/Differentiable-Task-Graph-Learning.
Related papers
- When Does Visual Prompting Outperform Linear Probing for Vision-Language Models? A Likelihood Perspective [57.05315507519704]
We propose a log-likelihood ratio (LLR) approach to analyze the comparative benefits of visual prompting and linear probing.
Our measure attains up to a 100-fold reduction in run time compared to full training, while achieving prediction accuracies up to 91%.
arXiv Detail & Related papers (2024-09-03T12:03:45Z) - No Train, all Gain: Self-Supervised Gradients Improve Deep Frozen Representations [30.9134119244757]
FUNGI is a method to enhance the features of transformer encoders by leveraging self-supervised gradients.
Our method is simple: given any pretrained model, we first compute gradients from various self-supervised objectives for each input.
The resulting features are evaluated on k-nearest neighbor classification over 11 datasets from vision, 5 from natural language processing, and 2 from audio.
arXiv Detail & Related papers (2024-07-15T17:58:42Z) - Differentiable Task Graph Learning: Procedural Activity Representation and Online Mistake Detection from Egocentric Videos [13.99137623722021]
Procedural activities are sequences of key-steps aimed at achieving specific goals.<n>Task graphs have emerged as a human-understandable representation of procedural activities.
arXiv Detail & Related papers (2024-06-03T16:11:39Z) - Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Local Masking Meets Progressive Freezing: Crafting Efficient Vision
Transformers for Self-Supervised Learning [0.0]
We present an innovative approach to self-supervised learning for Vision Transformers (ViTs)
This method focuses on enhancing the efficiency and speed of initial layer training in ViTs.
Our approach employs a novel multi-scale reconstruction process that fosters efficient learning in initial layers.
arXiv Detail & Related papers (2023-12-02T11:10:09Z) - HomE: Homography-Equivariant Video Representation Learning [62.89516761473129]
We propose a novel method for representation learning of multi-view videos.
Our method learns an implicit mapping between different views, culminating in a representation space that maintains the homography relationship between neighboring views.
On action classification, our method obtains 96.4% 3-fold accuracy on the UCF101 dataset, better than most state-of-the-art self-supervised learning methods.
arXiv Detail & Related papers (2023-06-02T15:37:43Z) - EgoDistill: Egocentric Head Motion Distillation for Efficient Video
Understanding [90.9111678470214]
We propose EgoDistill, a distillation-based approach that learns to reconstruct heavy egocentric video clip features.
Our method leads to significant improvements in efficiency, requiring 200x fewer GFLOPs than equivalent video models.
We demonstrate its effectiveness on the Ego4D and EPICKitchens datasets, where our method outperforms state-of-the-art efficient video understanding methods.
arXiv Detail & Related papers (2023-01-05T18:39:23Z) - Using Graph Algorithms to Pretrain Graph Completion Transformers [8.327657957422833]
Self-supervised pretraining can enhance performance on downstream graph, link, and node classification tasks.
We investigate five different pretraining signals, constructed using several graph algorithms and no external data, as well as their combination.
We propose a new path-finding algorithm guided by information gain and find that it is the best-performing pretraining task.
arXiv Detail & Related papers (2022-10-14T01:41:10Z) - GraphCoCo: Graph Complementary Contrastive Learning [65.89743197355722]
Graph Contrastive Learning (GCL) has shown promising performance in graph representation learning (GRL) without the supervision of manual annotations.
This paper proposes an effective graph complementary contrastive learning approach named GraphCoCo to tackle the above issue.
arXiv Detail & Related papers (2022-03-24T02:58:36Z) - KGE-CL: Contrastive Learning of Knowledge Graph Embeddings [64.67579344758214]
We propose a simple yet efficient contrastive learning framework for knowledge graph embeddings.
It can shorten the semantic distance of the related entities and entity-relation couples in different triples.
It can yield some new state-of-the-art results, achieving 51.2% MRR, 46.8% Hits@1 on the WN18RR dataset, and 59.1% MRR, 51.8% Hits@1 on the YAGO3-10 dataset.
arXiv Detail & Related papers (2021-12-09T12:45:33Z) - Self-supervised Semi-supervised Learning for Data Labeling and Quality
Evaluation [10.483508279350195]
We tackle the problems of efficient data labeling and annotation verification under the human-in-the-loop setting.
We propose a unifying framework by leveraging self-supervised semi-supervised learning and use it to construct for data labeling and verification tasks.
arXiv Detail & Related papers (2021-11-22T00:59:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.