AD-MIR: Bridging the Gap from Perception to Persuasion in Advertising Video Understanding via Structured Reasoning
- URL: http://arxiv.org/abs/2602.07625v1
- Date: Sat, 07 Feb 2026 17:14:06 GMT
- Title: AD-MIR: Bridging the Gap from Perception to Persuasion in Advertising Video Understanding via Structured Reasoning
- Authors: Binxiao Xu, Junyu Feng, Xiaopeng Lin, Haodong Li, Zhiyuan Feng, Bohan Zeng, Shaolin Lu, Ming Lu, Qi She, Wentao Zhang,
- Abstract summary: We introduce AD-MIR, a framework designed to decode advertising intent via a two-stage architecture.<n>The Structured Reasoning Agent mimics a marketing expert through an iterative inquiry loop, decomposing the narrative to deduce implicit persuasion tactics.<n>It achieves state-of-the-art performance, surpassing the strongest general-purpose agent, DVD, by 1.8% in strict and 9.5% in relaxed accuracy.
- Score: 31.074880930289083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal understanding of advertising videos is essential for interpreting the intricate relationship between visual storytelling and abstract persuasion strategies. However, despite excelling at general search, existing agents often struggle to bridge the cognitive gap between pixel-level perception and high-level marketing logic. To address this challenge, we introduce AD-MIR, a framework designed to decode advertising intent via a two-stage architecture. First, in the Structure-Aware Memory Construction phase, the system converts raw video into a structured database by integrating semantic retrieval with exact keyword matching. This approach prioritizes fine-grained brand details (e.g., logos, on-screen text) while dynamically filtering out irrelevant background noise to isolate key protagonists. Second, the Structured Reasoning Agent mimics a marketing expert through an iterative inquiry loop, decomposing the narrative to deduce implicit persuasion tactics. Crucially, it employs an evidence-based self-correction mechanism that rigorously validates these insights against specific video frames, automatically backtracking when visual support is lacking. Evaluation on the AdsQA benchmark demonstrates that AD-MIR achieves state-of-the-art performance, surpassing the strongest general-purpose agent, DVD, by 1.8% in strict and 9.5% in relaxed accuracy. These results underscore that effective advertising understanding demands explicitly grounding abstract marketing strategies in pixel-level evidence. The code is available at https://github.com/Little-Fridge/AD-MIR.
Related papers
- Decoding the Hook: A Multimodal LLM Framework for Analyzing the Hooking Period of Video Ads [9.34170961508317]
Video ads are a vital medium for brands to engage consumers, with social media platforms leveraging user data to optimize ad delivery and boost engagement.<n>A crucial but under-explored aspect is the 'hooking period', the first three seconds that capture viewer attention and influence engagement metrics.<n>This study presents a framework using transformer-based multimodal large language models (MLLMs) to analyze the hooking period of video ads.
arXiv Detail & Related papers (2026-02-25T18:24:06Z) - Video-BrowseComp: Benchmarking Agentic Video Research on Open Web [64.53060049124961]
Video-BrowseComp is a benchmark comprising 210 questions tailored for open-web agentic video reasoning.<n>It enforces a mandatory dependency on temporal visual evidence, ensuring answers cannot be derived solely through text search.<n>As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.
arXiv Detail & Related papers (2025-12-28T19:08:27Z) - ImplicitQA: Going beyond frames towards Implicit Video Reasoning [39.63171940350552]
ImplicitQA is a novel benchmark designed to test VideoQA models on human-like implicit reasoning.<n>ImplicitQA comprises 1K meticulously annotated QA pairs drawn from 1K high-quality creative video clips.
arXiv Detail & Related papers (2025-06-26T19:53:54Z) - Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding [23.022070084937603]
We introduce a semantics-driven search framework that reformulates selection under the paradigm of Visual Semantic-Logical Search.<n>Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics.
arXiv Detail & Related papers (2025-03-17T13:07:34Z) - QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension [86.0749609778104]
We propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models.<n>QuoTA strategically allocates frame-level importance scores based on query relevance.<n>We decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring.
arXiv Detail & Related papers (2025-03-11T17:59:57Z) - From Objects to Events: Unlocking Complex Visual Understanding in Object Detectors via LLM-guided Symbolic Reasoning [71.41062111470414]
Current object detectors excel at entity localization and classification, yet exhibit inherent limitations in event recognition capabilities.<n>We present a novel framework that expands the capability of standard object detectors beyond mere object recognition to complex event understanding.<n>Our key innovation lies in bridging the semantic gap between object detection and event understanding without requiring expensive task-specific training.
arXiv Detail & Related papers (2025-02-09T10:30:54Z) - HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads [10.61722566941537]
This paper focuses on improving query-to-video relevance matching to enhance the effectiveness of ranking in ad systems.<n>It utilizes a simple yet effective method to enhance the consistency between pre-training and relevance tasks.<n>The proposed method has been deployed in the Kuaishou search advertising system for over a year, contributing to a 6.1% reduction in the proportion of irrelevant ads and a 1.4% increase in ad revenue.
arXiv Detail & Related papers (2025-02-09T09:07:11Z) - DistinctAD: Distinctive Audio Description Generation in Contexts [62.58375366359421]
We propose DistinctAD, a framework for generating Audio Descriptions that emphasize distinctiveness to produce better narratives.<n>To address the domain gap, we introduce a CLIP-AD adaptation strategy that does not require additional AD corpora.<n>In Stage-II, DistinctAD incorporates two key innovations: (i) a Contextual Expectation-Maximization Attention (EMA) module that reduces redundancy by extracting common bases from consecutive video clips, and (ii) an explicit distinctive word prediction loss that filters out repeated words in the context.
arXiv Detail & Related papers (2024-11-27T09:54:59Z) - Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog [81.2968606962913]
Video-grounded dialog requires profound understanding of both dialog history and video content for accurate response generation.<n>We present an iterative search and reasoning framework, which consists of a textual encoder, a visual encoder, and a generator.
arXiv Detail & Related papers (2023-10-11T07:37:13Z) - A Multimodal Framework for Video Ads Understanding [64.70769354696019]
We develop a multimodal system to improve the ability of structured analysis of advertising video content.
Our solution achieved a score of 0.2470 measured in consideration of localization and prediction accuracy, ranking fourth in the 2021 TAAC final leaderboard.
arXiv Detail & Related papers (2021-08-29T16:06:00Z) - Unboxing Engagement in YouTube Influencer Videos: An Attention-Based Approach [0.3686808512438362]
"What is said" through words (text) is more important than "how it is said" through imagery (video images) or acoustics (audio) in predicting video engagement.<n>We analyze unstructured data from long-form YouTube influencer videos.
arXiv Detail & Related papers (2020-12-22T19:32:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.