LiveChat: Video Comment Generation from Audio-Visual Multimodal Contexts
- URL: http://arxiv.org/abs/2311.12826v1
- Date: Sun, 1 Oct 2023 02:35:58 GMT
- Title: LiveChat: Video Comment Generation from Audio-Visual Multimodal Contexts
- Authors: Julien Lalanne, Raphael Bournet, Yi Yu
- Abstract summary: We create a large-scale audio-visual multimodal dialogue dataset to facilitate the development of live commenting technologies.
The data is collected from Twitch, with 11 different categories and 575 streamers for a total of 438 hours of video and 3.2 million comments.
We propose a novel multimodal generation model capable of generating live comments that align with the temporal and spatial events within the video.
- Score: 8.070778830276275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Live commenting on video, a popular feature of live streaming platforms,
enables viewers to engage with the content and share their comments, reactions,
opinions, or questions with the streamer or other viewers while watching the
video or live stream. It presents a challenging testbed for AI agents, which
involves the simultaneous understanding of audio-visual multimodal contexts
from live streams and the ability to interact with human viewers through
dialogue. As existing live streaming-based comments datasets contain limited
categories and lack a diversity, we create a large-scale audio-visual
multimodal dialogue dataset to facilitate the development of live commenting
technologies. The data is collected from Twitch, with 11 different categories
and 575 streamers for a total of 438 hours of video and 3.2 million comments.
Moreover, we propose a novel multimodal generation model capable of generating
live comments that align with the temporal and spatial events within the video,
as well as with the ongoing multimodal dialogue context. Our initial results
have demonstrated the effectiveness of the proposed model, providing a robust
foundation for further research and practical applications in the field of live
video interaction.
Related papers
- VideoLLM-online: Online Video Large Language Model for Streaming Video [27.073238234038826]
We propose a novel Learning-In-Video-Stream framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream.
Our framework supports streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU.
It also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting.
arXiv Detail & Related papers (2024-06-17T17:55:32Z) - Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions.
We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms.
Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - Affective Faces for Goal-Driven Dyadic Communication [16.72177738101024]
We introduce a video framework for modeling the association between verbal and non-verbal communication during dyadic conversation.
Our approach retrieves a video of a listener, who has facial expressions that would be socially appropriate given the context.
arXiv Detail & Related papers (2023-01-26T05:00:09Z) - TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real
World [97.58623810402563]
We introduce a new video-based multi-modal dialogue dataset, called TikTalk.
We collect 38K videos from a popular video-sharing platform, along with 367K conversations posted by users beneath them.
Users engage in spontaneous conversations based on their multi-modal experiences from watching videos, which helps recreate real-world chitchat context.
arXiv Detail & Related papers (2023-01-14T10:18:22Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z) - Multimodal Matching Transformer for Live Commenting [97.06576354830736]
Automatic live commenting aims to provide real-time comments on videos for viewers.
Recent work on this task adopts encoder-decoder models to generate comments.
We propose a multimodal matching transformer to capture the relationships among comments, vision, and audio.
arXiv Detail & Related papers (2020-02-07T07:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.