Cross-modal Consensus Network for Weakly Supervised Temporal Action
Localization
- URL: http://arxiv.org/abs/2107.12589v1
- Date: Tue, 27 Jul 2021 04:21:01 GMT
- Title: Cross-modal Consensus Network for Weakly Supervised Temporal Action
Localization
- Authors: Fa-Ting Hong, Jia-Chang Feng, Dan Xu, Ying Shan, Wei-Shi Zheng
- Abstract summary: Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision.
We propose a cross-modal consensus network (CO2-Net) to tackle this problem.
- Score: 74.34699679568818
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly supervised temporal action localization (WS-TAL) is a challenging task
that aims to localize action instances in the given video with video-level
categorical supervision. Both appearance and motion features are used in
previous works, while they do not utilize them in a proper way but apply simple
concatenation or score-level fusion. In this work, we argue that the features
extracted from the pretrained extractor, e.g., I3D, are not the
WS-TALtask-specific features, thus the feature re-calibration is needed for
reducing the task-irrelevant information redundancy. Therefore, we propose a
cross-modal consensus network (CO2-Net) to tackle this problem. In CO2-Net, we
mainly introduce two identical proposed cross-modal consensus modules (CCM)
that design a cross-modal attention mechanism to filter out the task-irrelevant
information redundancy using the global information from the main modality and
the cross-modal local information of the auxiliary modality. Moreover, we treat
the attention weights derived from each CCMas the pseudo targets of the
attention weights derived from another CCM to maintain the consistency between
the predictions derived from two CCMs, forming a mutual learning manner.
Finally, we conduct extensive experiments on two common used temporal action
localization datasets, THUMOS14 and ActivityNet1.2, to verify our method and
achieve the state-of-the-art results. The experimental results show that our
proposed cross-modal consensus module can produce more representative features
for temporal action localization.
Related papers
- Interactive incremental learning of generalizable skills with local trajectory modulation [14.416251854298409]
We propose an interactive imitation learning framework that simultaneously leverages local and global modulations of trajectory distributions.
Our approach exploits the concept of via-points to incrementally and interactively 1) improve the model accuracy locally, 2) add new objects to the task during execution and 3) extend the skill into regions where demonstrations were not provided.
arXiv Detail & Related papers (2024-09-09T14:22:19Z) - S$^3$M-Net: Joint Learning of Semantic Segmentation and Stereo Matching
for Autonomous Driving [40.305452898732774]
S$3$M-Net is a novel joint learning framework developed to perform semantic segmentation and stereo matching simultaneously.
S$3$M-Net shares the features extracted from RGB images between both tasks, resulting in an improved overall scene understanding capability.
arXiv Detail & Related papers (2024-01-21T06:47:33Z) - Towards Lightweight Cross-domain Sequential Recommendation via External
Attention-enhanced Graph Convolution Network [7.1102362215550725]
Cross-domain Sequential Recommendation (CSR) depicts the evolution of behavior patterns for overlapped users by modeling their interactions from multiple domains.
We introduce a lightweight external attention-enhanced GCN-based framework to solve the above challenges, namely LEA-GCN.
To further alleviate the framework structure and aggregate the user-specific sequential pattern, we devise a novel dual-channel External Attention (EA) component.
arXiv Detail & Related papers (2023-02-07T03:06:29Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - SpatioTemporal Focus for Skeleton-based Action Recognition [66.8571926307011]
Graph convolutional networks (GCNs) are widely adopted in skeleton-based action recognition.
We argue that the performance of recent proposed skeleton-based action recognition methods is limited by the following factors.
Inspired by the recent attention mechanism, we propose a multi-grain contextual focus module, termed MCF, to capture the action associated relation information.
arXiv Detail & Related papers (2022-03-31T02:45:24Z) - Joint-bone Fusion Graph Convolutional Network for Semi-supervised
Skeleton Action Recognition [65.78703941973183]
We propose a novel correlation-driven joint-bone fusion graph convolutional network (CD-JBF-GCN) as an encoder and use a pose prediction head as a decoder.
Specifically, the CD-JBF-GC can explore the motion transmission between the joint stream and the bone stream.
The pose prediction based auto-encoder in the self-supervised training stage allows the network to learn motion representation from unlabeled data.
arXiv Detail & Related papers (2022-02-08T16:03:15Z) - CTRN: Class-Temporal Relational Network for Action Detection [7.616556723260849]
We introduce an end-to-end network: Class-Temporal Network (CTRN)
CTRN contains three key components: The Transform Representation Module, the Class-Temporal Module and the G-classifier.
We evaluate CTR on three densely labelled datasets and achieve state-of-the-art performance.
arXiv Detail & Related papers (2021-10-26T08:15:47Z) - Learning to Combine the Modalities of Language and Video for Temporal
Moment Localization [4.203274985072923]
Temporal moment localization aims to retrieve the best video segment matching a moment specified by a query.
We introduce a novel recurrent unit, cross-modal long short-term memory (CM-LSTM), by mimicking the human cognitive process of localizing temporal moments.
We also devise a two-stream attention mechanism for both attended and unattended video features by the input query to prevent necessary visual information from being neglected.
arXiv Detail & Related papers (2021-09-07T08:25:45Z) - CoADNet: Collaborative Aggregation-and-Distribution Networks for
Co-Salient Object Detection [91.91911418421086]
Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images.
One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships.
We present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images.
arXiv Detail & Related papers (2020-11-10T04:28:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.