SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models
- URL: http://arxiv.org/abs/2312.09818v3
- Date: Fri, 24 May 2024 09:45:09 GMT
- Title: SMILE: Multimodal Dataset for Understanding Laughter in Video with Language Models
- Authors: Lee Hyun, Kim Sung-Bin, Seungju Han, Youngjae Yu, Tae-Hyun Oh,
- Abstract summary: We tackle a new challenge for machines to understand the rationale behind laughter in video.
Our proposed dataset, SMILE, comprises video clips and language descriptions of why people laugh.
- Score: 32.60274453610208
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the recent advances of the artificial intelligence, building social intelligence remains a challenge. Among social signals, laughter is one of the distinctive expressions that occurs during social interactions between humans. In this work, we tackle a new challenge for machines to understand the rationale behind laughter in video, Video Laugh Reasoning. We introduce this new task to explain why people laugh in a particular video and a dataset for this task. Our proposed dataset, SMILE, comprises video clips and language descriptions of why people laugh. We propose a baseline by leveraging the reasoning capacity of large language models (LLMs) with textual video representation. Experiments show that our baseline can generate plausible explanations for laughter. We further investigate the scalability of our baseline by probing other video understanding tasks and in-the-wild videos. We release our dataset, code, and model checkpoints on https://github.com/postech-ami/SMILE-Dataset.
Related papers
- Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models [27.936545041302377]
Large language models (LLMs) can generate synthetic data for humor detection via editing texts.
We benchmark LLMs on an existing human dataset and show that current LLMs display an impressive ability to 'unfun' jokes.
We extend our approach to a code-mixed English-Hindi humor dataset, where we find that GPT-4's synthetic data is highly rated by bilingual annotators.
arXiv Detail & Related papers (2024-02-23T02:58:12Z) - LaughTalk: Expressive 3D Talking Head Generation with Laughter [15.60843963655039]
We introduce a novel task to generate 3D talking heads capable of both articulate speech and authentic laughter.
Our newly curated dataset comprises 2D laughing videos paired with pseudo-annotated and human-validated 3D FLAME parameters.
Our method performs favorably compared to existing approaches in both talking head generation and expressing laughter signals.
arXiv Detail & Related papers (2023-11-02T05:04:33Z) - Can Language Models Laugh at YouTube Short-form Videos? [40.47384055149102]
We curate a user-generated dataset of 10K multimodal funny videos from YouTube, called ExFunTube.
Using a video filtering pipeline with GPT-3.5, we verify both verbal and visual elements contributing to humor.
After filtering, we annotate each video with timestamps and text explanations for funny moments.
arXiv Detail & Related papers (2023-10-22T03:01:38Z) - VideoLLM: Modeling Video Sequence with Large Language Models [70.32832021713864]
Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks.
We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs.
VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
arXiv Detail & Related papers (2023-05-22T17:51:22Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - Laughing Matters: Introducing Laughing-Face Generation using Diffusion
Models [35.688696422879175]
We propose a novel model capable of generating realistic laughter sequences, given a still portrait and an audio clip containing laughter.
We train our model on a diverse set of laughter datasets and introduce an evaluation metric specifically designed for laughter.
Our model achieves state-of-the-art performance across all metrics, even when these are re-trained for laughter generation.
arXiv Detail & Related papers (2023-05-15T17:59:57Z) - Video Generation from Text Employing Latent Path Construction for
Temporal Modeling [70.06508219998778]
Video generation is one of the most challenging tasks in Machine Learning and Computer Vision fields of study.
In this paper, we tackle the text to video generation problem, which is a conditional form of video generation.
We believe that video generation from natural language sentences will have an important impact on Artificial Intelligence.
arXiv Detail & Related papers (2021-07-29T06:28:20Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - VIOLIN: A Large-Scale Dataset for Video-and-Language Inference [103.7457132841367]
We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text.
Given a video clip with subtitles aligned as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip.
A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips.
arXiv Detail & Related papers (2020-03-25T20:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.