Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition
- URL: http://arxiv.org/abs/2501.04121v1
- Date: Tue, 07 Jan 2025 20:02:55 GMT
- Title: Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition
- Authors: Julia Lee Romero, Kyle Min, Subarna Tripathi, Morteza Karimzadeh,
- Abstract summary: We propose a flexible graph-learning framework for fine-grained keystep recognition in egocentric videos.
We show that our proposed framework notably outperforms existing methods by more than 12 points in accuracy.
We also present a study examining on harnessing several multimodal features, including narrations, depth, and object class labels, on a heterogeneous graph.
- Score: 11.421362760480527
- License:
- Abstract: Egocentric videos capture scenes from a wearer's viewpoint, resulting in dynamic backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep recognition. We propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos, and leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos. Our approach consists of constructing a graph where each video clip of the egocentric video corresponds to a node. During training, we consider each clip of each exocentric video (if available) as additional nodes. We examine several strategies to define connections across these nodes and pose keystep recognition as a node classification task on the constructed graphs. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods by more than 12 points in accuracy. Furthermore, the constructed graphs are sparse and compute efficient. We also present a study examining on harnessing several multimodal features, including narrations, depth, and object class labels, on a heterogeneous graph and discuss their corresponding contribution to the keystep recognition performance.
Related papers
- VideoSAGE: Video Summarization with Graph Representation Learning [9.21019970479227]
We propose a graph-based representation learning framework for video summarization.
A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck.
arXiv Detail & Related papers (2024-04-14T15:49:02Z) - Video-Mined Task Graphs for Keystep Recognition in Instructional Videos [71.16703750980143]
Procedural activity understanding requires perceiving human actions in terms of a broader task.
We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps.
We show the impact: more reliable zero-shot keystep localization and improved video representation learning.
arXiv Detail & Related papers (2023-07-17T18:19:36Z) - Pose-Aided Video-based Person Re-Identification via Recurrent Graph
Convolutional Network [41.861537712563816]
We propose to learn the discriminative pose feature beyond the appearance feature for video retrieval.
To learn the pose feature, we first detect the pedestrian pose in each frame through an off-the-shelf pose detector.
We then exploit a recurrent graph convolutional network (RGCN) to learn the node embeddings of the temporal pose graph.
arXiv Detail & Related papers (2022-09-23T13:20:33Z) - Representing Videos as Discriminative Sub-graphs for Action Recognition [165.54738402505194]
We introduce a new design of sub-graphs to represent and encode theriminative patterns of each action in the videos.
We present MUlti-scale Sub-Earn Ling (MUSLE) framework that novelly builds space-time graphs and clusters into compact sub-graphs on each scale.
arXiv Detail & Related papers (2022-01-11T16:15:25Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [52.84233165201391]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - Group Contrastive Self-Supervised Learning on Graphs [101.45974132613293]
We study self-supervised learning on graphs using contrastive methods.
We argue that contrasting graphs in multiple subspaces enables graph encoders to capture more abundant characteristics.
arXiv Detail & Related papers (2021-07-20T22:09:21Z) - Learning Multi-Granular Hypergraphs for Video-Based Person
Re-Identification [110.52328716130022]
Video-based person re-identification (re-ID) is an important research topic in computer vision.
We propose a novel graph-based framework, namely Multi-Granular Hypergraph (MGH) to better representational capabilities.
90.0% top-1 accuracy on MARS is achieved using MGH, outperforming the state-of-the-arts schemes.
arXiv Detail & Related papers (2021-04-30T11:20:02Z) - SumGraph: Video Summarization via Recursive Graph Modeling [59.01856443537622]
We propose graph modeling networks for video summarization, termed SumGraph, to represent a relation graph.
We achieve state-of-the-art performance on several benchmarks for video summarization in both supervised and unsupervised manners.
arXiv Detail & Related papers (2020-07-17T08:11:30Z) - Cut-Based Graph Learning Networks to Discover Compositional Structure of
Sequential Video Data [29.841574293529796]
We propose Cut-Based Graph Learning Networks (CB-GLNs) for learning video data by discovering complex structures of the video.
CB-GLNs represent video data as a graph, with nodes and edges corresponding to frames of the video and their dependencies respectively.
We evaluate the proposed method on the two different tasks for video understanding: Video theme classification (Youtube-8M dataset) and Video Question and Answering (TVQA dataset)
arXiv Detail & Related papers (2020-01-17T10:09:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.