TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation
- URL: http://arxiv.org/abs/2510.07249v2
- Date: Mon, 13 Oct 2025 02:46:39 GMT
- Title: TalkCuts: A Large-Scale Dataset for Multi-Shot Human Speech Video Generation
- Authors: Jiaben Chen, Zixin Wang, Ailing Zeng, Yang Fu, Xueyang Yu, Siyuan Cen, Julian Tanke, Yihang Chen, Koichi Saito, Yuki Mitsufuji, Chuang Gan,
- Abstract summary: We present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation.<n>TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views.<n>The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation.
- Score: 76.48551690189406
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this work, we present TalkCuts, a large-scale dataset designed to facilitate the study of multi-shot human speech video generation. Unlike existing datasets that focus on single-shot, static viewpoints, TalkCuts offers 164k clips totaling over 500 hours of high-quality human speech videos with diverse camera shots, including close-up, half-body, and full-body views. The dataset includes detailed textual descriptions, 2D keypoints and 3D SMPL-X motion annotations, covering over 10k identities, enabling multimodal learning and evaluation. As a first attempt to showcase the value of the dataset, we present Orator, an LLM-guided multi-modal generation framework as a simple baseline, where the language model functions as a multi-faceted director, orchestrating detailed specifications for camera transitions, speaker gesticulations, and vocal modulation. This architecture enables the synthesis of coherent long-form videos through our integrated multi-modal video generation module. Extensive experiments in both pose-guided and audio-driven settings show that training on TalkCuts significantly enhances the cinematographic coherence and visual appeal of generated multi-shot speech videos. We believe TalkCuts provides a strong foundation for future work in controllable, multi-shot speech video generation and broader multimodal learning.
Related papers
- Multi-human Interactive Talking Dataset [20.920129008402718]
We introduce MIT, a large-scale dataset specifically designed for multi-human talking video generation.<n>The resulting dataset comprises 12 hours of high-resolution footage, each featuring two to four speakers.<n>It captures natural conversational dynamics in multi-speaker scenario, offering a rich resource for studying interactive visual behaviors.
arXiv Detail & Related papers (2025-08-05T03:54:18Z) - Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation [34.15566431966277]
We propose a novel task: Multi-Person Conversational Video Generation.<n>We introduce a new framework, MultiTalk, to address the challenges during multi-person generation.
arXiv Detail & Related papers (2025-05-28T17:57:06Z) - MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.<n>We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.<n>Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - VideoPoet: A Large Language Model for Zero-Shot Video Generation [78.57171527944774]
VideoPoet is a language model capable of synthesizing high-quality video with matching audio.
VideoPoet employs a decoder-only transformer architecture that processes multimodal inputs.
arXiv Detail & Related papers (2023-12-21T18:46:41Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - Robust One Shot Audio to Video Generation [10.957973845883162]
OneShotA2V is a novel approach to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person.
OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person.
arXiv Detail & Related papers (2020-12-14T10:50:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.