BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation
- URL: http://arxiv.org/abs/2603.02816v1
- Date: Tue, 03 Mar 2026 10:10:41 GMT
- Title: BrandFusion: A Multi-Agent Framework for Seamless Brand Integration in Text-to-Video Generation
- Authors: Zihao Zhu, Ruotong Wang, Siwei Lyu, Min Zhang, Baoyuan Wu,
- Abstract summary: We introduce seamless brand integration in text-to-video (T2V) models.<n>This task confronts three core challenges: maintaining prompt fidelity, ensuring brand recognizability, and achieving contextually natural integration.<n>We propose BrandFusion, a novel multi-agent framework comprising two synergistic phases.
- Score: 64.5799743375449
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid advancement of text-to-video (T2V) models has revolutionized content creation, yet their commercial potential remains largely untapped. We introduce, for the first time, the task of seamless brand integration in T2V: automatically embedding advertiser brands into prompt-generated videos while preserving semantic fidelity to user intent. This task confronts three core challenges: maintaining prompt fidelity, ensuring brand recognizability, and achieving contextually natural integration. To address them, we propose BrandFusion, a novel multi-agent framework comprising two synergistic phases. In the offline phase (advertiser-facing), we construct a Brand Knowledge Base by probing model priors and adapting to novel brands via lightweight fine-tuning. In the online phase (user-facing), five agents jointly refine user prompts through iterative refinement, leveraging the shared knowledge base and real-time contextual tracking to ensure brand visibility and semantic alignment. Experiments on 18 established and 2 custom brands across multiple state-of-the-art T2V models demonstrate that BrandFusion significantly outperforms baselines in semantic preservation, brand recognizability, and integration naturalness. Human evaluations further confirm higher user satisfaction, establishing a practical pathway for sustainable T2V monetization.
Related papers
- Factorized Video Generation: Decoupling Scene Construction and Temporal Synthesis in Text-to-Video Diffusion Models [76.7535001311919]
State-of-the-art Text-to-Video (T2V) diffusion models can generate visually impressive results, yet they often fail to compose complex scenes or follow logical temporal instructions.<n>We introduce Factorized Video Generation (FVG), a pipeline that decouples these tasks by decomposing the Text-to-Video generation into three specialized stages.<n>Our approach sets a new state-of-the-art on the T2V CompBench benchmark and significantly improves all tested models on VBench2.
arXiv Detail & Related papers (2025-12-18T10:10:45Z) - From Unlearning to UNBRANDING: A Benchmark for Trademark-Safe Text-to-Image Generation [0.7798283447125206]
Brand recognition is multi-dimensional, extending beyond explicit logos to encompass distinctive structural features.<n>We introduce unbranding, a novel task for the fine-grained removal of both trademarks and subtle structural brand features.<n>Our results, validated by our Vision Language Models metric, confirm unbranding is a distinct, practically relevant problem.
arXiv Detail & Related papers (2025-12-15T23:15:36Z) - CIDER: A Causal Cure for Brand-Obsessed Text-to-Image Models [8.256738887166089]
Text-to-image (T2I) models exhibit a significant yet under-explored "brand bias"<n>We propose CIDER, a model-agnostic framework to mitigate bias at inference-time through prompt refinement to avoid costly retraining.
arXiv Detail & Related papers (2025-09-19T09:30:37Z) - BiMark: Unbiased Multilayer Watermarking for Large Language Models [68.64050157343334]
We propose BiMark, a novel watermarking framework that balances text quality preservation and message embedding capacity.<n>BiMark achieves up to 30% higher extraction rates for short texts while maintaining text quality indicated by lower perplexity.
arXiv Detail & Related papers (2025-06-19T11:08:59Z) - SkyReels-A2: Compose Anything in Video Diffusion Transformers [27.324119455991926]
This paper presents SkyReels-A2, a controllable video generation framework capable of assembling arbitrary visual elements into synthesized videos.<n>We term this task elements-to-video (E2V) whose primary challenges lie in preserving the fidelity of each reference element, ensuring coherent composition of the scene, and achieving natural outputs.<n>We propose a novel image-text joint embedding model to inject multi-element representations into the generative process, balancing element-specific consistency with global coherence and text alignment.
arXiv Detail & Related papers (2025-04-03T09:50:50Z) - LogoSticker: Inserting Logos into Diffusion Models for Customized Generation [73.59571559978278]
We introduce the task of logo insertion into text-to-image models.
Our goal is to insert logo identities into diffusion models and enable their seamless synthesis in varied contexts.
We present a novel two-phase pipeline LogoSticker to tackle this task.
arXiv Detail & Related papers (2024-07-18T17:54:49Z) - WorldGPT: A Sora-Inspired Video AI Agent as Rich World Models from Text
and Image Inputs [53.21307319844615]
We present an innovative video generation AI agent that harnesses the power of Sora-inspired multimodal learning to build skilled world models framework.
The framework includes two parts: prompt enhancer and full video translation.
arXiv Detail & Related papers (2024-03-10T16:09:02Z) - Mini-DALLE3: Interactive Text to Image by Prompting Large Language
Models [71.49054220807983]
A prevalent limitation persists in the effective communication with T2I models, such as Stable Diffusion, using natural language descriptions.
Inspired by the recently released DALLE3, we revisit the existing T2I systems endeavoring to align human intent and introduce a new task - interactive text to image (iT2I)
We present a simple approach that augments LLMs for iT2I with prompting techniques and off-the-shelf T2I models.
arXiv Detail & Related papers (2023-10-11T16:53:40Z) - The Open Brands Dataset: Unified brand detection and recognition at
scale [33.624955564405425]
"Open Brands" is the largest dataset for brand detection and recognition with rich annotations.
"Brand Net" is a network called "Brand Net" to handle brand recognition.
arXiv Detail & Related papers (2020-12-14T09:06:42Z) - An Integrated Approach for Improving Brand Consistency of Web Content:
Modeling, Analysis and Recommendation [27.312543486663536]
We collect around 300K web page content from around 650 companies.
We develop trait-specific classification models by considering the linguistic features of the content.
We then develop a sentence ranking system that outputs the top three sentences that need to be changed for making a web article more consistent with the company's brand personality.
arXiv Detail & Related papers (2020-11-19T10:18:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.