Related papers: AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping

AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping

URL: http://arxiv.org/abs/2510.26569v1
Date: Thu, 30 Oct 2025 14:59:37 GMT
Title: AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping
Authors: Wen Xie, Yanjun Zhu, Gijs Overgoor, Yakov Bart, Agata Lapedriza Garcia, Sarah Ostadabbas,
Abstract summary: We introduce a framework for automated video ad clipping using video summarization techniques.<n>We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising.<n>To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads.
Score: 6.340098119165037
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advertisers commonly need multiple versions of the same advertisement (ad) at varying durations for a single campaign. The traditional approach involves manually selecting and re-editing shots from longer video ads to create shorter versions, which is labor-intensive and time-consuming. In this paper, we introduce a framework for automated video ad clipping using video summarization techniques. We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising. Unlike existing general video summarization methods that primarily focus on visual content, our approach emphasizes the critical role of audio in advertising. To achieve this, we develop a two-stream audio-visual fusion model that predicts the importance of video frames, where importance is defined as the likelihood of a frame being selected in the firm-produced short ad. To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads from real advertising campaigns. Extensive experiments demonstrate that our model outperforms state-of-the-art methods across various metrics, including Average Precision, Area Under Curve, Spearman, and Kendall.

Related papers

Decoding the Hook: A Multimodal LLM Framework for Analyzing the Hooking Period of Video Ads [9.34170961508317]
Video ads are a vital medium for brands to engage consumers, with social media platforms leveraging user data to optimize ad delivery and boost engagement.<n>A crucial but under-explored aspect is the 'hooking period', the first three seconds that capture viewer attention and influence engagement metrics.<n>This study presents a framework using transformer-based multimodal large language models (MLLMs) to analyze the hooking period of video ads.
arXiv Detail & Related papers (2026-02-25T18:24:06Z)
SUMMA: A Multimodal Large Language Model for Advertisement Summarization [15.514886325064792]
We propose SUMMA, a model that processes video ads into summaries highlighting the content of highest commercial value.<n> SUMMA is developed via a two-stage training strategy-multimodal supervised fine-tuning followed by reinforcement learning.<n>Online experiments show a statistically significant 1.5% increase in advertising revenue.
arXiv Detail & Related papers (2025-08-28T09:19:53Z)
Subject-driven Video Generation via Disentangled Identity and Motion [52.54835936914813]
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning.<n>Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings.
arXiv Detail & Related papers (2025-04-23T06:48:31Z)
Shot-by-Shot: Film-Grammar-Aware Training-Free Audio Description Generation [110.79299467093006]
We propose a two-stage framework that leverages "shots" as the fundamental units of video understanding.<n>This includes extending temporal context to neighbouring shots and incorporating film grammar devices, such as shot scales and thread structures.<n>Our method is compatible with both open-source and proprietary Visual-Language Models.
arXiv Detail & Related papers (2025-04-01T17:59:57Z)
CTR-Driven Advertising Image Generation with Multimodal Large Language Models [53.40005544344148]
We explore the use of Multimodal Large Language Models (MLLMs) for generating advertising images by optimizing for Click-Through Rate (CTR) as the primary objective.<n>To further improve the CTR of generated images, we propose a novel reward model to fine-tune pre-trained MLLMs through Reinforcement Learning (RL)<n>Our method achieves state-of-the-art performance in both online and offline metrics.
arXiv Detail & Related papers (2025-02-05T09:06:02Z)
Multi-subject Open-set Personalization in Video Generation [110.02124633005516]
We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities.<n>Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt.<n>Our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-01-10T18:59:54Z)
Long-Term Ad Memorability: Understanding & Generating Memorable Ads [54.23854539909078]
Despite the importance of long-term memory in marketing and brand building, until now, there has been no large-scale study on the memorability of ads.<n>We release the first memorability dataset, LAMBDA, consisting of 1749 participants and 2205 ads covering 276 brands.<n>Running statistical tests over different participant subpopulations and ad types, we find many interesting insights into what makes an ad memorable, e.g., fast-moving ads are more memorable than those with slower scenes.<n>We present a scalable method to build a high-quality memorable ad generation model by leveraging automatically annotated data.
arXiv Detail & Related papers (2023-09-01T10:27:04Z)
Multi-modal Representation Learning for Video Advertisement Content Structuring [10.45050088240847]
Video advertisement content structuring aims to segment a given video advertisement and label each segment on various dimensions. Video advertisements contain sufficient and useful multi-modal content like caption and speech. We propose a multi-modal encoder to learn multi-modal representation from video advertisements by interacting between video-audio and text.
arXiv Detail & Related papers (2021-09-04T09:08:29Z)
A Multimodal Framework for Video Ads Understanding [64.70769354696019]
We develop a multimodal system to improve the ability of structured analysis of advertising video content. Our solution achieved a score of 0.2470 measured in consideration of localization and prediction accuracy, ranking fourth in the 2021 TAAC final leaderboard.
arXiv Detail & Related papers (2021-08-29T16:06:00Z)
Predicting Online Video Advertising Effects with Multimodal Deep Learning [33.20913249848369]
We propose a method for predicting the click through rate (CTR) of video advertisements and analyzing the factors that determine the CTR. In this paper, we demonstrate an optimized framework for accurately predicting the effects by taking advantage of the multimodal nature of online video advertisements.
arXiv Detail & Related papers (2020-12-22T06:24:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.