Query by Activity Video in the Wild
- URL: http://arxiv.org/abs/2311.13895v1
- Date: Thu, 23 Nov 2023 10:26:36 GMT
- Title: Query by Activity Video in the Wild
- Authors: Tao Hu, William Thong, Pascal Mettes, Cees G.M. Snoek
- Abstract summary: In current query-by-activity-video literature, a common assumption is that all activities have sufficient labelled examples when learning an embedding.
We propose a visual-semantic embedding network that explicitly deals with the imbalanced scenario for activity retrieval.
- Score: 52.42177539947216
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: This paper focuses on activity retrieval from a video query in an imbalanced
scenario. In current query-by-activity-video literature, a common assumption is
that all activities have sufficient labelled examples when learning an
embedding. This assumption does however practically not hold, as only a portion
of activities have many examples, while other activities are only described by
few examples. In this paper, we propose a visual-semantic embedding network
that explicitly deals with the imbalanced scenario for activity retrieval. Our
network contains two novel modules. The visual alignment module performs a
global alignment between the input video and fixed-sized visual bank
representations for all activities. The semantic module performs an alignment
between the input video and fixed-sized semantic activity representations. By
matching videos with both visual and semantic activity representations that are
of equal size over all activities, we no longer ignore infrequent activities
during retrieval. Experiments on a new imbalanced activity retrieval benchmark
show the effectiveness of our approach for all types of activities.
Related papers
- Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting [87.11995635760108]
Key to action counting is accurately locating each video's repetitive actions.
We propose a dual-branch network, i.e., SkimFocusNet, working in a two-step manner.
arXiv Detail & Related papers (2024-06-13T05:15:52Z) - Distribution Matching for Multi-Task Learning of Classification Tasks: a
Large-Scale Study on Faces & Beyond [62.406687088097605]
Multi-Task Learning (MTL) is a framework, where multiple related tasks are learned jointly and benefit from a shared representation space.
We show that MTL can be successful with classification tasks with little, or non-overlapping annotations.
We propose a novel approach, where knowledge exchange is enabled between the tasks via distribution matching.
arXiv Detail & Related papers (2024-01-02T14:18:11Z) - Video-Specific Query-Key Attention Modeling for Weakly-Supervised
Temporal Action Localization [14.43055117008746]
Weakly-trimmed temporal action localization aims to identify and localize the action instances in the unsupervised videos with only video-level action labels.
We propose a network named VQK-Net with a video-specific query-key attention modeling that learns a unique query for each action category of each input video.
arXiv Detail & Related papers (2023-05-07T04:18:22Z) - Audio-Adaptive Activity Recognition Across Video Domains [112.46638682143065]
We leverage activity sounds for domain adaptation as they have less variance across domains and can reliably indicate which activities are not happening.
We propose an audio-adaptive encoder and associated learning methods that discriminatively adjust the visual feature representation.
We also introduce the new task of actor shift, with a corresponding audio-visual dataset, to challenge our method with situations where the activity appearance changes dramatically.
arXiv Detail & Related papers (2022-03-27T08:15:20Z) - Temporal Action Segmentation with High-level Complex Activity Labels [29.17792724210746]
We learn the action segments taking only the high-level activity labels as input.
We propose a novel action discovery framework that automatically discovers constituent actions in videos.
arXiv Detail & Related papers (2021-08-15T09:50:42Z) - Multi-Label Activity Recognition using Activity-specific Features and
Activity Correlations [15.356959177480965]
We introduce an approach to multi-label activity recognition that extracts independent feature descriptors for each activity and learns activity correlations.
Our method outperformed state-of-the-art approaches on four multi-label activity recognition datasets.
arXiv Detail & Related papers (2020-09-16T01:57:34Z) - Revisiting Few-shot Activity Detection with Class Similarity Control [107.79338380065286]
We present a framework for few-shot temporal activity detection based on proposal regression.
Our model is end-to-end trainable, takes into account the frame rate differences between few-shot activities and untrimmed test videos, and can benefit from additional few-shot examples.
arXiv Detail & Related papers (2020-03-31T22:02:38Z) - ZSTAD: Zero-Shot Temporal Activity Detection [107.63759089583382]
We propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected.
We design an end-to-end deep network based on R-C3D as the architecture for this solution.
Experiments on both the THUMOS14 and the Charades datasets show promising performance in terms of detecting unseen activities.
arXiv Detail & Related papers (2020-03-12T02:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.