Few-Shot Action Recognition with Compromised Metric via Optimal
Transport
- URL: http://arxiv.org/abs/2104.03737v1
- Date: Thu, 8 Apr 2021 12:42:05 GMT
- Title: Few-Shot Action Recognition with Compromised Metric via Optimal
Transport
- Authors: Su Lu, Han-Jia Ye, De-Chuan Zhan
- Abstract summary: Few-shot action recognition is still not mature despite the wide research of few-shot image classification.
One main obstacle to applying these algorithms in action recognition is the complex structure of videos.
We propose Compromised Metric via Optimal Transport (CMOT) to combine the advantages of these two solutions.
- Score: 31.834843714684343
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although vital to computer vision systems, few-shot action recognition is
still not mature despite the wide research of few-shot image classification.
Popular few-shot learning algorithms extract a transferable embedding from seen
classes and reuse it on unseen classes by constructing a metric-based
classifier. One main obstacle to applying these algorithms in action
recognition is the complex structure of videos. Some existing solutions sample
frames from a video and aggregate their embeddings to form a video-level
representation, neglecting important temporal relations. Others perform an
explicit sequence matching between two videos and define their distance as
matching cost, imposing too strong restrictions on sequence ordering. In this
paper, we propose Compromised Metric via Optimal Transport (CMOT) to combine
the advantages of these two solutions. CMOT simultaneously considers semantic
and temporal information in videos under Optimal Transport framework, and is
discriminative for both content-sensitive and ordering-sensitive tasks. In
detail, given two videos, we sample segments from them and cast the calculation
of their distance as an optimal transport problem between two segment
sequences. To preserve the inherent temporal ordering information, we
additionally amend the ground cost matrix by penalizing it with the positional
distance between a pair of segments. Empirical results on benchmark datasets
demonstrate the superiority of CMOT.
Related papers
- MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic
Video Segmentation [10.82074185158027]
We introduce Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic long-video segmentation.
The method coarsely aligns inputs of variable lengths and different modalities with alignment positional encoding.
MEGA employs a novel contrastive loss to synchronize and transfer labels across modalities, enabling act segmentation from labeled synopsis sentences on video shots.
arXiv Detail & Related papers (2023-08-22T04:23:59Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - A Closer Look at Temporal Ordering in the Segmentation of Instructional
Videos [17.712793578388126]
We take a closer look at Procedure and Summarization (PSS) and propose three fundamental improvements over current methods.
We propose a new segmentation metric based on dynamic programming that takes into account the order of segments.
We propose a matching algorithm that constrains the temporal order of segment mapping, and is also differentiable.
arXiv Detail & Related papers (2022-09-30T14:44:19Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Joint Inductive and Transductive Learning for Video Object Segmentation [107.32760625159301]
Semi-supervised object segmentation is a task of segmenting the target object in a video sequence given only a mask in the first frame.
Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning.
We propose to integrate transductive and inductive learning into a unified framework to exploit complement between them for accurate and robust video object segmentation.
arXiv Detail & Related papers (2021-08-08T16:25:48Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.