FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild
- URL: http://arxiv.org/abs/2401.04210v1
- Date: Mon, 8 Jan 2024 19:39:36 GMT
- Title: FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild
- Authors: Zhi-Song Liu, Robin Courant, Vicky Kalogeiton
- Abstract summary: We propose FunnyNet-W, a model that relies on cross- and self-attention for visual, audio and text data to predict funny moments in videos.
We provide experiments on five datasets: the sitcoms TBBT, MHD, MUStARD, Friends, and the TED talk UR-Funny.
- Score: 12.530540250653633
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatically understanding funny moments (i.e., the moments that make people
laugh) when watching comedy is challenging, as they relate to various features,
such as body language, dialogues and culture. In this paper, we propose
FunnyNet-W, a model that relies on cross- and self-attention for visual, audio
and text data to predict funny moments in videos. Unlike most methods that rely
on ground truth data in the form of subtitles, in this work we exploit
modalities that come naturally with videos: (a) video frames as they contain
visual information indispensable for scene understanding, (b) audio as it
contains higher-level cues associated with funny moments, such as intonation,
pitch and pauses and (c) text automatically extracted with a speech-to-text
model as it can provide rich information when processed by a Large Language
Model. To acquire labels for training, we propose an unsupervised approach that
spots and labels funny audio moments. We provide experiments on five datasets:
the sitcoms TBBT, MHD, MUStARD, Friends, and the TED talk UR-Funny. Extensive
experiments and analysis show that FunnyNet-W successfully exploits visual,
auditory and textual cues to identify funny moments, while our findings reveal
FunnyNet-W's ability to predict funny moments in the wild. FunnyNet-W sets the
new state of the art for funny moment detection with multimodal cues on all
datasets with and without using ground truth information.
Related papers
- SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models [32.60274453610208]
We tackle a new challenge for machines to understand the rationale behind laughter in video.
Our proposed dataset, SMILE, comprises video clips and language descriptions of why people laugh.
arXiv Detail & Related papers (2023-12-15T14:17:45Z) - Can Language Models Laugh at YouTube Short-form Videos? [40.47384055149102]
We curate a user-generated dataset of 10K multimodal funny videos from YouTube, called ExFunTube.
Using a video filtering pipeline with GPT-3.5, we verify both verbal and visual elements contributing to humor.
After filtering, we annotate each video with timestamps and text explanations for funny moments.
arXiv Detail & Related papers (2023-10-22T03:01:38Z) - What You Say Is What You Show: Visual Narration Detection in
Instructional Videos [108.77600799637172]
We introduce the novel task of visual narration detection, which entails determining whether a narration is visually depicted by the actions in the video.
We propose What You Say is What You Show (WYS2), a method that leverages multi-modal cues and pseudo-labeling to learn to detect visual narrations with only weakly labeled data.
Our model successfully detects visual narrations in in-the-wild videos, outperforming strong baselines, and we demonstrate its impact for state-of-the-art summarization and temporal alignment of instructional videos.
arXiv Detail & Related papers (2023-01-05T21:43:19Z) - Language-free Training for Zero-shot Video Grounding [50.701372436100684]
Video grounding aims to localize the time interval by understanding the text and video simultaneously.
One of the most challenging issues is an extremely time- and cost-consuming annotation collection.
We present a simple yet novel training framework for video grounding in the zero-shot setting.
arXiv Detail & Related papers (2022-10-24T06:55:29Z) - 3MASSIV: Multilingual, Multimodal and Multi-Aspect dataset of Social
Media Short Videos [72.69052180249598]
We present 3MASSIV, a multilingual, multimodal and multi-aspect, expertly-annotated dataset of diverse short videos extracted from short-video social media platform - Moj.
3MASSIV comprises of 50k short videos (20 seconds average duration) and 100K unlabeled videos in 11 different languages.
We show how the social media content in 3MASSIV is dynamic and temporal in nature, which can be used for semantic understanding tasks and cross-lingual analysis.
arXiv Detail & Related papers (2022-03-28T02:47:01Z) - M2H2: A Multimodal Multiparty Hindi Dataset For Humor Recognition in
Conversations [72.81164101048181]
We propose a dataset for Multimodal Multiparty Hindi Humor (M2H2) recognition in conversations containing 6,191 utterances from 13 episodes of a very popular TV series "Shrimaan Shrimati Phir Se"
Each utterance is annotated with humor/non-humor labels and encompasses acoustic, visual, and textual modalities.
The empirical results on M2H2 dataset demonstrate that multimodal information complements unimodal information for humor recognition.
arXiv Detail & Related papers (2021-08-03T02:54:09Z) - DeHumor: Visual Analytics for Decomposing Humor [36.300283476950796]
We develop DeHumor, a visual system for analyzing humorous behaviors in public speaking.
To intuitively reveal the building blocks of each concrete example, DeHumor decomposes each humorous video into multimodal features.
We show that DeHumor is able to highlight various building blocks of humor examples.
arXiv Detail & Related papers (2021-07-18T04:01:07Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - AVLnet: Learning Audio-Visual Language Representations from
Instructional Videos [69.56522471911396]
We introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs.
We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks.
Our code, data, and trained models will be released at avlnet.csail.mit.edu.
arXiv Detail & Related papers (2020-06-16T14:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.