Localizing the Common Action Among a Few Videos
- URL: http://arxiv.org/abs/2008.05826v2
- Date: Tue, 25 Aug 2020 12:30:01 GMT
- Title: Localizing the Common Action Among a Few Videos
- Authors: Pengwan Yang, Vincent Tao Hu, Pascal Mettes, Cees G. M. Snoek
- Abstract summary: This paper strives to localize the temporal extent of an action in a long untrimmed video.
We introduce a new 3D convolutional network architecture able to align representations from the support videos with the relevant query video segments.
- Score: 51.09824165433561
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper strives to localize the temporal extent of an action in a long
untrimmed video. Where existing work leverages many examples with their start,
their ending, and/or the class of the action during training time, we propose
few-shot common action localization. The start and end of an action in a long
untrimmed video is determined based on just a hand-full of trimmed video
examples containing the same action, without knowing their common class label.
To address this task, we introduce a new 3D convolutional network architecture
able to align representations from the support videos with the relevant query
video segments. The network contains: (\textit{i}) a mutual enhancement module
to simultaneously complement the representation of the few trimmed support
videos and the untrimmed query video; (\textit{ii}) a progressive alignment
module that iteratively fuses the support videos into the query branch; and
(\textit{iii}) a pairwise matching module to weigh the importance of different
support videos. Evaluation of few-shot common action localization in untrimmed
videos containing a single or multiple action instances demonstrates the
effectiveness and general applicability of our proposal.
Related papers
- Improving Video Corpus Moment Retrieval with Partial Relevance Enhancement [72.7576395034068]
Video Corpus Moment Retrieval (VCMR) is a new video retrieval task aimed at retrieving a relevant moment from a large corpus of untrimmed videos using a text query.
We argue that effectively capturing the partial relevance between the query and video is essential for the VCMR task.
For video retrieval, we introduce a multi-modal collaborative video retriever, generating different query representations for the two modalities.
For moment localization, we propose the focus-then-fuse moment localizer, utilizing modality-specific gates to capture essential content.
arXiv Detail & Related papers (2024-02-21T07:16:06Z) - UnLoc: A Unified Framework for Video Localization Tasks [82.59118972890262]
UnLoc is a new approach for temporal localization in untrimmed videos.
It uses pretrained image and text towers, and feeds tokens to a video-text fusion model.
We achieve state of the art results on all three different localization tasks with a unified approach.
arXiv Detail & Related papers (2023-08-21T22:15:20Z) - Video-Specific Query-Key Attention Modeling for Weakly-Supervised
Temporal Action Localization [14.43055117008746]
Weakly-trimmed temporal action localization aims to identify and localize the action instances in the unsupervised videos with only video-level action labels.
We propose a network named VQK-Net with a video-specific query-key attention modeling that learns a unique query for each action category of each input video.
arXiv Detail & Related papers (2023-05-07T04:18:22Z) - Hierarchical Video-Moment Retrieval and Step-Captioning [68.4859260853096]
HiREST consists of 3.4K text-video pairs from an instructional video dataset.
Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks.
arXiv Detail & Related papers (2023-03-29T02:33:54Z) - TL;DW? Summarizing Instructional Videos with Task Relevance &
Cross-Modal Saliency [133.75876535332003]
We focus on summarizing instructional videos, an under-explored area of video summarization.
Existing video summarization datasets rely on manual frame-level annotations.
We propose an instructional video summarization network that combines a context-aware temporal video encoder and a segment scoring transformer.
arXiv Detail & Related papers (2022-08-14T04:07:40Z) - Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework [108.70949305791201]
Part-level Action Parsing (PAP) aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video.
In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action.
Our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.
arXiv Detail & Related papers (2022-03-09T01:30:57Z) - Deep Multimodal Feature Encoding for Video Ordering [34.27175264084648]
We present a way to learn a compact multimodal feature representation that encodes all these modalities.
Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline.
We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition.
arXiv Detail & Related papers (2020-04-05T14:02:23Z) - SCT: Set Constrained Temporal Transformer for Set Supervised Action
Segmentation [22.887397951846353]
Weakly supervised approaches aim at learning temporal action segmentation from videos that are only weakly labeled.
We propose an approach that can be trained end-to-end on such data.
We evaluate our approach on three datasets where the approach achieves state-of-the-art results.
arXiv Detail & Related papers (2020-03-31T14:51:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.