Learning from Semantic Alignment between Unpaired Multiviews for
Egocentric Video Recognition
- URL: http://arxiv.org/abs/2308.11489v2
- Date: Wed, 23 Aug 2023 16:16:44 GMT
- Title: Learning from Semantic Alignment between Unpaired Multiviews for
Egocentric Video Recognition
- Authors: Qitong Wang, Long Zhao, Liangzhe Yuan, Ting Liu, Xi Peng
- Abstract summary: We propose Semantics-based Unpaired Multiview Learning (SUM-L) to tackle this unpaired multiview learning problem.
Key idea is to build cross-view pseudo-pairs and do view-invariant alignment by leveraging the semantic information of videos.
Our method also outperforms multiple existing view-alignment methods, under the more challenging scenario.
- Score: 23.031934558964473
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We are concerned with a challenging scenario in unpaired multiview video
learning. In this case, the model aims to learn comprehensive multiview
representations while the cross-view semantic information exhibits variations.
We propose Semantics-based Unpaired Multiview Learning (SUM-L) to tackle this
unpaired multiview learning problem. The key idea is to build cross-view
pseudo-pairs and do view-invariant alignment by leveraging the semantic
information of videos. To facilitate the data efficiency of multiview learning,
we further perform video-text alignment for first-person and third-person
videos, to fully leverage the semantic knowledge to improve video
representations. Extensive experiments on multiple benchmark datasets verify
the effectiveness of our framework. Our method also outperforms multiple
existing view-alignment methods, under the more challenging scenario than
typical paired or unpaired multimodal or multiview learning. Our code is
available at https://github.com/wqtwjt1996/SUM-L.
Related papers
- Cross-view Graph Contrastive Representation Learning on Partially
Aligned Multi-view Data [52.491074276133325]
Multi-view representation learning has developed rapidly over the past decades and has been applied in many fields.
We propose a new cross-view graph contrastive learning framework, which integrates multi-view information to align data and learn latent representations.
Experiments conducted on several real datasets demonstrate the effectiveness of the proposed method on the clustering and classification tasks.
arXiv Detail & Related papers (2022-11-08T09:19:32Z) - Boosting Video Representation Learning with Multi-Faceted Integration [112.66127428372089]
Video content is multifaceted, consisting of objects, scenes, interactions or actions.
Existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset.
We propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content.
arXiv Detail & Related papers (2022-01-11T16:14:23Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - Seeing All the Angles: Learning Multiview Manipulation Policies for
Contact-Rich Tasks from Demonstrations [7.51557557629519]
A successful multiview policy could be deployed on a mobile manipulation platform.
We demonstrate that a multiview policy can be found through imitation learning by collecting data from a variety of viewpoints.
We show that learning from multiview data has little, if any, penalty to performance for a fixed-view task compared to learning with an equivalent amount of fixed-view data.
arXiv Detail & Related papers (2021-04-28T17:43:29Z) - Multiview Pseudo-Labeling for Semi-supervised Learning from Video [102.36355560553402]
We present a novel framework that uses complementary views in the form of appearance and motion information for semi-supervised learning in video.
Our method capitalizes on multiple views, but it nonetheless trains a model that is shared across appearance and motion input.
On multiple video recognition datasets, our method substantially outperforms its supervised counterpart, and compares favorably to previous work on standard benchmarks in self-supervised video representation learning.
arXiv Detail & Related papers (2021-04-01T17:59:48Z) - Embedded Deep Bilinear Interactive Information and Selective Fusion for
Multi-view Learning [70.67092105994598]
We propose a novel multi-view learning framework to make the multi-view classification better aimed at the above-mentioned two aspects.
In particular, we train different deep neural networks to learn various intra-view representations.
Experiments on six publicly available datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2020-07-13T01:13:23Z) - Multi-view Low-rank Preserving Embedding: A Novel Method for Multi-view
Representation [11.91574721055601]
This paper proposes a novel multi-view learning method, named Multi-view Low-rank Preserving Embedding (MvLPE)
It integrates different views into one centroid view by minimizing the disagreement term, based on distance or similarity matrix among instances.
Experiments on six benchmark datasets demonstrate that the proposed method outperforms its counterparts.
arXiv Detail & Related papers (2020-06-14T12:47:25Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z) - Generalized Multi-view Shared Subspace Learning using View Bootstrapping [43.027427742165095]
Key objective in multi-view learning is to model the information common to multiple parallel views of a class of objects/events to improve downstream learning tasks.
We present a neural method based on multi-view correlation to capture the information shared across a large number of views by subsampling them in a view-agnostic manner during training.
Experiments on spoken word recognition, 3D object classification and pose-invariant face recognition demonstrate the robustness of view bootstrapping to model a large number of views.
arXiv Detail & Related papers (2020-05-12T20:35:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.