Related papers: Speech2Action: Cross-modal Supervision for Action Recognition

Speech2Action: Cross-modal Supervision for Action Recognition

URL: http://arxiv.org/abs/2003.13594v1
Date: Mon, 30 Mar 2020 16:22:39 GMT
Title: Speech2Action: Cross-modal Supervision for Action Recognition
Authors: Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman
Abstract summary: We train a BERT-based Speech2Action classifier on over a thousand movie screenplays. We then apply this model to the speech segments of a large unlabelled movie corpus. Using the predictions of this model, we obtain weak action labels for over 800K video clips.
Score: 127.10071447772407
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Is it possible to guess human action from dialogue alone? In this work we investigate the link between spoken words and actions in movies. We note that movie screenplays describe actions, as well as contain the speech of characters and hence can be used to learn this correlation with no additional supervision. We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments. We then apply this model to the speech segments of a large unlabelled movie corpus (188M speech segments from 288K movies). Using the predictions of this model, we obtain weak action labels for over 800K video clips. By training on these video clips, we demonstrate superior action recognition performance on standard action recognition benchmarks, without using a single manually labelled action example.

Related papers

Beyond Label Semantics: Language-Guided Action Anatomy for Few-shot Action Recognition [16.07037171149096]
Few-shot action recognition (FSAR) aims to classify human actions in videos with only a small number of samples labeled per category.<n>We propose Language-Guided Action Anatomy (LGA), a novel framework that goes beyond label semantics.<n>For text, we prompt an off-the-shelf LLM to anatomize labels into sequences of atomic action descriptions.<n>For videos, a Visual Anatomy Module segments actions into atomic video phases to capture the sequential structure of actions.
arXiv Detail & Related papers (2025-07-22T07:16:25Z)
MoCha: Towards Movie-Grade Talking Character Synthesis [62.007000023747445]
We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. We propose MoCha, the first of its kind to generate talking characters.
arXiv Detail & Related papers (2025-03-30T04:22:09Z)
Pseudo-labeling with Keyword Refining for Few-Supervised Video Captioning [42.0725330677271]
We propose a few-supervised video captioning framework that consists of lexically constrained pseudo-labeling module and keyword-refined captioning module. Experiments on several benchmarks demonstrate the advantages of the proposed approach in both few-supervised and fully-supervised scenarios.
arXiv Detail & Related papers (2024-11-06T17:11:44Z)
Character-aware audio-visual subtitling in context [58.95580154761008]
This paper presents an improved framework for character-aware audio-visual subtitling in TV shows. Our approach integrates speech recognition, speaker diarisation, and character recognition, utilising both audio and visual cues. We validate the method on a dataset with 12 TV shows, demonstrating superior performance in speaker diarisation and character recognition accuracy compared to existing approaches.
arXiv Detail & Related papers (2024-10-14T20:27:34Z)
TokenSplit: Using Discrete Speech Representations for Direct, Refined, and Transcript-Conditioned Speech Separation and Recognition [51.565319173790314]
TokenSplit is a sequence-to-sequence encoder-decoder model that uses the Transformer architecture. We show that our model achieves excellent performance in terms of separation, both with or without transcript conditioning. We also measure the automatic speech recognition (ASR) performance and provide audio samples of speech synthesis to demonstrate the additional utility of our model.
arXiv Detail & Related papers (2023-08-21T01:52:01Z)
Movie101: A New Movie Understanding Benchmark [47.24519006577205]
We construct a large-scale Chinese movie benchmark, named Movie101. We propose a new metric called Movie Narration Score (MNScore) for movie narrating evaluation. For both two tasks, our proposed methods well leverage external knowledge and outperform carefully designed baselines.
arXiv Detail & Related papers (2023-05-20T08:43:51Z)
An Action Is Worth Multiple Words: Handling Ambiguity in Action Recognition [18.937012620464465]
We address the challenge of training multi-label action recognition models from only single positive training labels. We propose two approaches that are based on generating pseudo training examples sampled from similar instances within the train set. We create a new evaluation benchmark by manually annotating a subset of EPIC-Kitchens-100's validation set with multiple verb labels.
arXiv Detail & Related papers (2022-10-10T18:06:43Z)
Bridge-Prompt: Towards Ordinal Action Understanding in Instructional Videos [92.18898962396042]
We propose a prompt-based framework, Bridge-Prompt, to model the semantics across adjacent actions. We reformulate the individual action labels as integrated text prompts for supervision, which bridge the gap between individual action semantics. Br-Prompt achieves state-of-the-art on multiple benchmarks.
arXiv Detail & Related papers (2022-03-26T15:52:27Z)
BABEL: Bodies, Action and Behavior with English Labels [53.83774092560076]
We present BABEL, a large dataset with language labels describing the actions being performed in mocap sequences. There are over 28k sequence labels, and 63k frame labels in BABEL, which belong to over 250 unique action categories. We demonstrate the value of BABEL as a benchmark, and evaluate the performance of models on 3D action recognition.
arXiv Detail & Related papers (2021-06-17T17:51:14Z)
Fine-grained Emotion and Intent Learning in Movie Dialogues [1.2891210250935146]
We propose a novel large-scale emotional dialogue dataset, consisting of 1M dialogues retrieved from the OpenSubtitles corpus. This work explains the complex pipeline used to preprocess movie subtitles and select good movie dialogues to annotate. This scale of emotional dialogue classification has never been attempted before, both in terms of dataset size and fine-grained emotion and intent categories.
arXiv Detail & Related papers (2020-12-25T20:29:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.