Tell me what you see: A zero-shot action recognition method based on
natural language descriptions
- URL: http://arxiv.org/abs/2112.09976v2
- Date: Mon, 11 Sep 2023 17:57:15 GMT
- Title: Tell me what you see: A zero-shot action recognition method based on
natural language descriptions
- Authors: Valter Estevam and Rayson Laroca and David Menotti and Helio Pedrini
- Abstract summary: We propose using video captioning methods to extract semantic information from videos.
To the best of our knowledge, this is the first work to represent both videos and labels with descriptive sentences.
We build a shared semantic space employing BERT-based embedders pre-trained in the paraphrasing task on multiple text datasets.
- Score: 3.136605193634262
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel approach to Zero-Shot Action Recognition. Recent
works have explored the detection and classification of objects to obtain
semantic information from videos with remarkable performance. Inspired by them,
we propose using video captioning methods to extract semantic information about
objects, scenes, humans, and their relationships. To the best of our knowledge,
this is the first work to represent both videos and labels with descriptive
sentences. More specifically, we represent videos using sentences generated via
video captioning methods and classes using sentences extracted from documents
acquired through search engines on the Internet. Using these representations,
we build a shared semantic space employing BERT-based embedders pre-trained in
the paraphrasing task on multiple text datasets. The projection of both visual
and semantic information onto this space is straightforward, as they are
sentences, enabling classification using the nearest neighbor rule. We
demonstrate that representing videos and labels with sentences alleviates the
domain adaptation problem. Additionally, we show that word vectors are
unsuitable for building the semantic embedding space of our descriptions. Our
method outperforms the state-of-the-art performance on the UCF101 dataset by
3.3 p.p. in accuracy under the TruZe protocol and achieves competitive results
on both the UCF101 and HMDB51 datasets under the conventional protocol (0/50\%
- training/testing split). Our code is available at
https://github.com/valterlej/zsarcap.
Related papers
- An Evaluation of Large Pre-Trained Models for Gesture Recognition using Synthetic Videos [32.257816070522885]
We explore the possibility of using synthetically generated data for video-based gesture recognition with large pre-trained models.
We use various state-of-the-art video encoders to extract features for use in k-nearest neighbors classification.
We find that using synthetic training videos yields significantly lower classification accuracy on real test videos compared to using a relatively small number of real training videos.
arXiv Detail & Related papers (2024-10-03T02:31:14Z) - Reasoning over the Behaviour of Objects in Video-Clips for Adverb-Type Recognition [54.938128496934695]
We propose a new framework that reasons over object-behaviours extracted from raw-video-clips to recognize the clip's corresponding adverb-types.
Specifically, we propose a novel pipeline that extracts human-interpretable object-behaviour-facts from raw video clips.
We release two new datasets of object-behaviour-facts extracted from raw video clips.
arXiv Detail & Related papers (2023-07-09T09:04:26Z) - Exploiting Auxiliary Caption for Video Grounding [66.77519356911051]
Video grounding aims to locate a moment of interest matching a given query sentence from an untrimmed video.
Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset.
We propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS)
To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and
arXiv Detail & Related papers (2023-01-15T02:04:02Z) - Simplifying Open-Set Video Domain Adaptation with Contrastive Learning [16.72734794723157]
unsupervised video domain adaptation methods have been proposed to adapt a predictive model from a labelled dataset to an unlabelled dataset.
We address a more realistic scenario, called open-set video domain adaptation (OUVDA), where the target dataset contains "unknown" semantic categories that are not shared with the source.
We propose a video-oriented temporal contrastive loss that enables our method to better cluster the feature space by exploiting the freely available temporal information in video data.
arXiv Detail & Related papers (2023-01-09T13:16:50Z) - Learning Using Privileged Information for Zero-Shot Action Recognition [15.9032110752123]
This paper presents a novel method that uses object semantics as privileged information to narrow the semantic gap.
Experiments on the Olympic Sports, HMDB51 and UCF101 datasets have shown that the proposed method outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-06-17T08:46:09Z) - Contrastive Graph Multimodal Model for Text Classification in Videos [9.218562155255233]
We are the first to address this new task of video text classification by fusing multimodal information.
We tailor a specific module called CorrelationNet to reinforce feature representation by explicitly extracting layout information.
We construct a new well-defined industrial dataset from the news domain, called TI-News, which is dedicated to building and evaluating video text recognition and classification applications.
arXiv Detail & Related papers (2022-06-06T04:06:21Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Learning Video Representations from Textual Web Supervision [97.78883761035557]
We propose to use text as a method for learning video representations.
We collect 70M video clips shared publicly on the Internet and train a model to pair each video with its associated text.
We find that this approach is an effective method of pre-training video representations.
arXiv Detail & Related papers (2020-07-29T16:19:50Z) - Transferring Cross-domain Knowledge for Video Sign Language Recognition [103.9216648495958]
Word-level sign language recognition (WSLR) is a fundamental task in sign language interpretation.
We propose a novel method that learns domain-invariant visual concepts and fertilizes WSLR models by transferring knowledge of subtitled news sign to them.
arXiv Detail & Related papers (2020-03-08T03:05:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.