Related papers: Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation

URL: http://arxiv.org/abs/2512.04426v2
Date: Tue, 09 Dec 2025 19:02:37 GMT
Title: Self-Paced and Self-Corrective Masked Prediction for Movie Trailer Generation
Authors: Sidan Zhu, Hongteng Xu, Dixin Luo,
Abstract summary: Currently, most existing automatic trailer generation methods employ a "selection-then-ranking" paradigm.<n>We propose SSMP, which achieves state-of-the-art results in automatic trailer generation via bi-directional contextual modeling and progressive self-correction.<n>Both quantitative results and user studies demonstrate the superiority of SSMP in comparison to existing automatic movie trailer generation methods.
Score: 40.42119751907875
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As a challenging video editing task, movie trailer generation involves selecting and reorganizing movie shots to create engaging trailers. Currently, most existing automatic trailer generation methods employ a "selection-then-ranking" paradigm (i.e., first selecting key shots and then ranking them), which suffers from inevitable error propagation and limits the quality of the generated trailers. Beyond this paradigm, we propose a new self-paced and self-corrective masked prediction method called SSMP, which achieves state-of-the-art results in automatic trailer generation via bi-directional contextual modeling and progressive self-correction. In particular, SSMP trains a Transformer encoder that takes the movie shot sequences as prompts and generates corresponding trailer shot sequences accordingly. The model is trained via masked prediction, reconstructing each trailer shot sequence from its randomly masked counterpart. The mask ratio is self-paced, allowing the task difficulty to adapt to the model and thereby improving model performance. When generating a movie trailer, the model fills the shot positions with high confidence at each step and re-masks the remaining positions for the next prediction, forming a progressive self-correction mechanism that is analogous to how human editors work. Both quantitative results and user studies demonstrate the superiority of SSMP in comparison to existing automatic movie trailer generation methods. Demo is available at: https://github.com/Dixin-Lab/SSMP.

Related papers

Trailer Reimagined: An Innovative, Llm-DRiven, Expressive Automated Movie Summary framework (TRAILDREAMS) [0.41998444721319217]
TRAILDREAMS is a framework that uses a large language model (LLM) to automate the production of movie trailers.<n>In comparative evaluations, TRAILDREAMS surpasses current state-of-the-art trailer generation methods in viewer ratings.<n>However, it still falls short when compared to real, human-crafted trailers.
arXiv Detail & Related papers (2026-02-02T17:53:25Z)
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [67.94300151774085]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z)
Movie Recommendation with Poster Attention via Multi-modal Transformer Feature Fusion [4.228539709089597]
This study proposes a multi-modal movie recommendation system by extract features of the well designed posters for each movie. The efficiency of the proof-of-concept model is verified by the standard benchmark problem the MovieLens 100K and 1M datasets.
arXiv Detail & Related papers (2024-07-12T10:44:51Z)
Towards Automated Movie Trailer Generation [98.9854474456265]
We introduce Trailer Generation Transformer (TGT), a deep-learning framework utilizing an encoder-decoder architecture. TGT movie encoder is tasked with contextualizing each movie shot representation via self-attention, while the autoregressive trailer decoder predicts the feature representation of the next trailer shot. Our TGT significantly outperforms previous methods on a comprehensive suite of metrics.
arXiv Detail & Related papers (2024-04-04T14:28:34Z)
Adversarial Pixel Restoration as a Pretext Task for Transferable Perturbations [54.1807206010136]
Transferable adversarial attacks optimize adversaries from a pretrained surrogate model and known label space to fool the unknown black-box models. We propose Adversarial Pixel Restoration as a self-supervised alternative to train an effective surrogate model from scratch. Our training approach is based on a min-max objective which reduces overfitting via an adversarial objective.
arXiv Detail & Related papers (2022-07-18T17:59:58Z)
Finding the Right Moment: Human-Assisted Trailer Creation via Task Composition [63.842627949509414]
We focus on finding trailer moments in a movie, i.e., shots that could be potentially included in a trailer.<n>We model movies as graphs, where nodes are shots and edges denote semantic relations between them.<n>An unsupervised algorithm then traverses the graph and selects trailer moments from the movie that human judges prefer to ones selected by competitive supervised approaches.<n>Our tool allows users to select trailer shots in under 30 minutes that are superior to fully automatic methods and comparable to (exclusive) manual selection by experts.
arXiv Detail & Related papers (2021-11-16T20:50:52Z)
Conditional Temporal Variational AutoEncoder for Action Video Prediction [66.63038712306606]
ACT-VAE predicts pose sequences for an action clips from a single input image. When connected with a plug-and-play Pose-to-Image (P2I) network, ACT-VAE can synthesize image sequences.
arXiv Detail & Related papers (2021-08-12T10:59:23Z)
Latent Variable Nested Set Transformers & AutoBots [25.194344543085005]
We propose a theoretical framework for this problem setting based on autoregressively modelling sequences of nested sets. We present a new model architecture which employs multi-head self-attention blocks over sets of sets that serve as a form of social attention between the elements of the sets at every timestep. We validate the Nested Set Transformer for autonomous driving settings which we refer to as ("AutoBot"), where we model the trajectory of an ego-agent based on the sequential observations of key attributes of multiple agents in a scene.
arXiv Detail & Related papers (2021-02-19T18:53:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.