ShoulderShot: Generating Over-the-Shoulder Dialogue Videos
- URL: http://arxiv.org/abs/2508.07597v2
- Date: Fri, 15 Aug 2025 09:27:47 GMT
- Title: ShoulderShot: Generating Over-the-Shoulder Dialogue Videos
- Authors: Yuang Zhang, Junqi Cheng, Haoyu Zhao, Jiaxi Gu, Fangyuan Zou, Zenghui Lu, Peng Shu,
- Abstract summary: ShoulderShot is a framework that combines dual-shot generation with looping video, enabling extended dialogues while preserving character consistency.<n>Our results demonstrate capabilities that surpass existing methods in terms of shot-reverse-shot layout, spatial continuity, and flexibility in dialogue length.
- Score: 10.699509921258564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Over-the-shoulder dialogue videos are essential in films, short dramas, and advertisements, providing visual variety and enhancing viewers' emotional connection. Despite their importance, such dialogue scenes remain largely underexplored in video generation research. The main challenges include maintaining character consistency across different shots, creating a sense of spatial continuity, and generating long, multi-turn dialogues within limited computational budgets. Here, we present ShoulderShot, a framework that combines dual-shot generation with looping video, enabling extended dialogues while preserving character consistency. Our results demonstrate capabilities that surpass existing methods in terms of shot-reverse-shot layout, spatial continuity, and flexibility in dialogue length, thereby opening up new possibilities for practical dialogue video generation. Videos and comparisons are available at https://shouldershot.github.io.
Related papers
- The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation [95.18045807704284]
We introduce an end-to-end agentic framework for dialogue-to-cinematic-video generation.<n> ScripterAgent is trained to translate coarse dialogue into a fine-grained, executable cinematic script.<n>Our framework significantly improves script faithfulness and temporal fidelity across all tested video models.
arXiv Detail & Related papers (2026-01-25T08:10:28Z) - HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives [97.61653035827919]
HoloCine is a model that generates entire scenes holistically to ensure global consistency from the first shot to the last.<n>Our architecture achieves precise directorial control through a Window Cross-Attention mechanism that localizes text prompts to specific shots.<n>Our work marks a pivotal shift from clip synthesis towards automated filmmaking, making end-to-end cinematic creation a tangible future.
arXiv Detail & Related papers (2025-10-23T17:59:59Z) - Action2Dialogue: Generating Character-Centric Narratives from Scene-Level Prompts [20.281732318265483]
We present a modular pipeline that transforms action-level prompts into visually and auditorily grounded narrative dialogue.<n>Our method takes as input a pair of prompts per scene, where the first defines the setting and the second specifies a character's behavior.<n>We render each utterance as expressive, character-conditioned speech, resulting in fully-voiced, multimodal video narratives.
arXiv Detail & Related papers (2025-05-22T15:54:42Z) - MoCha: Towards Movie-Grade Talking Character Synthesis [62.007000023747445]
We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text.<n>Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region.<n>We propose MoCha, the first of its kind to generate talking characters.
arXiv Detail & Related papers (2025-03-30T04:22:09Z) - KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus [69.46707346122113]
We propose a novel task and create a human-to-human video-driven multilingual mixed-type dialogue corpus.<n>The KwaiChat corpus contains a total of 93,209 videos and 246,080 dialogues, across 4 dialogue types, 30 domains, 4 languages, and 13 topics.<n>An analysis of 7 distinct LLMs on KwaiChat reveals that GPT-4o achieves the best performance but still cannot perform well in this situation.
arXiv Detail & Related papers (2025-03-10T04:05:38Z) - TV-Dialogue: Crafting Theme-Aware Video Dialogues with Immersive Interaction [25.851857218815415]
We introduce Theme-aware Video Dialogue Crafting (TVDC), a novel task aimed at generating new dialogues that align with video content and adhere to user-specified themes.<n>TV-Dialogue is a novel multi-modal agent framework that ensures both theme alignment and visual consistency.<n>Our findings underscore the potential of TV-Dialogue for various applications, such as video re-creation, film dubbing and its use in downstream multimodal tasks.
arXiv Detail & Related papers (2025-01-31T08:04:32Z) - Dialogue Director: Bridging the Gap in Dialogue Visualization for Multimodal Storytelling [15.410503589735699]
We propose Dialogue Visualization, a novel task that transforms dialogue scripts into dynamic, multi-view storyboards.<n>We introduce Dialogue Director, a training-free multimodal framework comprising a Script Director, Cinematographer, and Storyboard Maker.<n> Experimental results demonstrate that Dialogue Director outperforms state-of-the-art methods in script interpretation, physical world understanding, and cinematic principle application.
arXiv Detail & Related papers (2024-12-30T05:54:23Z) - VideoGen-of-Thought: Step-by-step generating multi-shot video with minimal manual intervention [70.61101071902596]
Current video generation models excel at short clips but fail to produce cohesive multi-shot narratives due to disjointed visual dynamics and fractured storylines.<n>We introduce VideoGen-of-Thought (VGoT), a step-by-step framework that automates multi-shot video synthesis from a single sentence.<n>VGoT generates multi-shot videos that outperform state-of-the-art baselines by 20.4% in within-shot face consistency and 17.4% in style consistency.
arXiv Detail & Related papers (2024-12-03T08:33:50Z) - Grounding is All You Need? Dual Temporal Grounding for Video Dialog [48.3411605700214]
This paper introduces the Dual Temporal Grounding-enhanced Video Dialog model (DTGVD)
It emphasizes dual temporal relationships by predicting dialog turn-specific temporal regions.
It also filters video content accordingly, and grounding responses in both video and dialog contexts.
arXiv Detail & Related papers (2024-10-08T07:48:34Z) - OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual
Contexts [35.57757367869986]
We release bf OpenViDial, a large-scale multi- module dialogue dataset.
OpenViDial contains a total number of 1.1 million dialogue turns.
We propose a family of encoder-decoder models leveraging both textual and visual contexts.
arXiv Detail & Related papers (2020-12-30T03:02:50Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.