Related papers: An Audio-centric Multi-task Learning Framework for Streaming Ads Targeting on Spotify

An Audio-centric Multi-task Learning Framework for Streaming Ads Targeting on Spotify

URL: http://arxiv.org/abs/2506.18735v1
Date: Mon, 23 Jun 2025 15:11:43 GMT
Title: An Audio-centric Multi-task Learning Framework for Streaming Ads Targeting on Spotify
Authors: Shivam Verma, Vivian Chen, Darren Mei,
Abstract summary: Spotify attracts over 675 million monthly active users who collectively consume millions of hours of music, podcasts, audiobooks, and video content.<n>This diverse content consumption pattern introduces unique challenges for computational advertising.<n>We introduce Cross-modal Adaptive Mixture-of-Experts (CAMoE), a novel framework for optimizing click-through rate (CTR) prediction in both audio-centric and multi-modal settings.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spotify, a large-scale multimedia platform, attracts over 675 million monthly active users who collectively consume millions of hours of music, podcasts, audiobooks, and video content. This diverse content consumption pattern introduces unique challenges for computational advertising, which must effectively integrate a variety of ad modalities, including audio, video, and display, within a single user experience. Traditional ad recommendation models, primarily designed for foregrounded experiences, often struggle to reconcile the platform's inherent audio-centrality with the demands of optimizing ad performance across multiple formats and modalities. To overcome these challenges, we introduce Cross-modal Adaptive Mixture-of-Experts (CAMoE), a novel framework for optimizing click-through rate (CTR) prediction in both audio-centric and multi-modal settings. CAMoE enhances traditional mixture-of-experts models by incorporating modality-aware task grouping, adaptive loss masking, and deep-cross networks (DCN) to capture complex feature interactions within a multi-modal ad ecosystem. Through extensive ablation studies, we demonstrate that this approach achieves near Pareto-optimal performance across audio, video, and display ad formats, significantly improving AUC-PR compared to conventional single-task and content-based multi-task learning baselines. When deployed at scale on Spotify's ad serving platform, CAMoE delivered substantial gains, yielding a 14.5% increase in CTR for audio ads, a 1.3% increase for video ads, and a 4.8% reduction in expected cost-per-click (eCPC) for audio slots.

Related papers

Cold-Starting Podcast Ads and Promotions with Multi-Task Learning on Spotify [2.204478225790133]
We present a unified multi-objective model for targeting both advertisements and promotions within the Spotify podcast ecosystem.<n>Online A/B tests show up to a 22% reduction in effective Cost-Per-Stream.<n>Our experience shows that a unified modeling strategy improves maintainability, cold-start performance, and coverage.
arXiv Detail & Related papers (2026-01-05T17:48:15Z)
Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning [44.518249924335045]
Perception Audiovisual, PE-AV, is a new family of encoders for audio and video understanding trained with scaled contrastive learning.<n>Built on PE, PE-AV makes several key contributions to extend representations to audio, and supports joint embeddings across audio-video, audio-text, and video-text modalities.
arXiv Detail & Related papers (2025-12-22T18:59:07Z)
AdSum: Two-stream Audio-visual Summarization for Automated Video Advertisement Clipping [6.340098119165037]
We introduce a framework for automated video ad clipping using video summarization techniques.<n>We are the first to frame video clipping as a shot selection problem, tailored specifically for advertising.<n>To address the lack of ad-specific datasets, we present AdSum204, a novel dataset comprising 102 pairs of 30-second and 15-second ads.
arXiv Detail & Related papers (2025-10-30T14:59:37Z)
SUMMA: A Multimodal Large Language Model for Advertisement Summarization [15.514886325064792]
We propose SUMMA, a model that processes video ads into summaries highlighting the content of highest commercial value.<n> SUMMA is developed via a two-stage training strategy-multimodal supervised fine-tuning followed by reinforcement learning.<n>Online experiments show a statistically significant 1.5% increase in advertising revenue.
arXiv Detail & Related papers (2025-08-28T09:19:53Z)
Exploring Audio Cues for Enhanced Test-Time Video Model Adaptation [46.29811604867483]
Test-time adaptation (TTA) aims to boost the generalization capability of a trained model by conducting self-/unsupervised learning during the testing phase.<n>We propose a novel approach that incorporates audio information into video TTA.<n>Our method capitalizes on the rich semantic content of audio to generate audio-assisted pseudo-labels.
arXiv Detail & Related papers (2025-06-14T12:44:58Z)
Semi-Supervised Audio-Visual Video Action Recognition with Audio Source Localization Guided Mixup [2.80888070977859]
We propose audio-visual SSL for video action recognition, which uses both visual and audio together.<n>In experiments on UCF-51, Kinetics-400, and VGGSound datasets, our model shows the superior performance of the proposed framework.
arXiv Detail & Related papers (2025-03-04T05:13:56Z)
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners [69.70590867769408]
Video and audio content creation serves as the core technique for the movie industry and professional users. Existing diffusion-based methods tackle video and audio generation separately, which hinders the technique transfer from academia to industry. In this work, we aim at filling the gap, with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation.
arXiv Detail & Related papers (2024-02-27T17:57:04Z)
AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework. It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z)
AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic codes [6.375996974877916]
We propose a method named AudioFormer, which learns audio feature representations through the acquisition of discrete acoustic codes. Our research outcomes demonstrate that AudioFormer attains significantly improved performance compared to prevailing monomodal audio classification models.
arXiv Detail & Related papers (2023-08-14T15:47:25Z)
Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner. In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z)
Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline [53.07236039168652]
We focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. We introduce the first Untrimmed Audio-Visual dataset, which contains 10K untrimmed videos with over 30K audio-visual events. Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass.
arXiv Detail & Related papers (2023-03-22T22:00:17Z)
Accommodating Audio Modality in CLIP for Multimodal Processing [48.83906067348211]
We extend the Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing. Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities. Our proposed CLIP4VLA model is validated in different downstream tasks including video retrieval and video captioning.
arXiv Detail & Related papers (2023-03-12T06:57:01Z)
AVE-CLIP: AudioCLIP-based Multi-window Temporal Transformer for Audio Visual Event Localization [14.103742565510387]
We introduce AVE-CLIP, a novel framework that integrates the AudioCLIP pre-trained on large-scale audio-visual data with a multi-window temporal transformer. Our method achieves state-of-the-art performance on the publicly available AVE dataset with 5.9% mean accuracy improvement.
arXiv Detail & Related papers (2022-10-11T00:15:45Z)
Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos [101.83513408195692]
We propose a multitask learning method for visual-audio saliency prediction and sound source localization on multi-face video. The proposed method outperforms 12 state-of-the-art saliency prediction methods, and achieves competitive results in sound source localization.
arXiv Detail & Related papers (2021-11-05T14:35:08Z)
A Multi-View Approach To Audio-Visual Speaker Verification [38.9710777250597]
In this study, we explore audio-visual approaches to speaker verification. We report the lowest AV equal error rate (EER) of 0.7% on the VoxCeleb1 dataset. This new approach achieves 28% EER on VoxCeleb1 in the challenging testing condition of cross-modal verification.
arXiv Detail & Related papers (2021-02-11T22:29:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.