Related papers: MealRec: Multi-granularity Sequential Modeling via Hierarchical Diffusion Models for Micro-Video Recommendation

MealRec: Multi-granularity Sequential Modeling via Hierarchical Diffusion Models for Micro-Video Recommendation

URL: http://arxiv.org/abs/2603.01926v1
Date: Mon, 02 Mar 2026 14:39:18 GMT
Title: MealRec: Multi-granularity Sequential Modeling via Hierarchical Diffusion Models for Micro-Video Recommendation
Authors: Xinxin Dong, Haokai Ma, Yuze Zheng, Yongfu Zha, Yonghui Yang, Xiaodong Wang,
Abstract summary: Micro-video recommendation aims to capture user preferences from the collaborative and context information of the interacted micro-videos.<n>We propose a Multi-granularity sequential modeling method via hierarchical diffusion models for micro-video Recommendation (MealRec)
Score: 9.268996272239034
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Micro-video recommendation aims to capture user preferences from the collaborative and context information of the interacted micro-videos, thereby predicting the appropriate videos. This target is often hindered by the inherent noise within multimodal content and unreliable implicit feedback, which weakens the correspondence between behaviors and underlying interests. While conventional works have predominantly approached such scenario through behavior-augmented modeling and content-centric multimodal analysis, these paradigms can inadvertently give rise to two non-trivial challenges: preference-irrelative video representation extraction and inherent modality conflicts. To address these issues, we propose a Multi-granularity sequential modeling method via hierarchical diffusion models for micro-video Recommendation (MealRec), which simultaneously considers temporal correlations during preference modeling from intra- and inter-video perspectives. Specifically, we first propose Temporal-guided Content Diffusion (TCD) to refine video representations under intra-video temporal guidance and personalized collaborative signals to emphasize salient content while suppressing redundancy. To achieve the semantically coherent preference modeling, we further design the Noise-unconditional Preference Denoising (NPD) to recovers informative user preferences from corrupted states under the blind denoising. Extensive experiments and analyses on four micro-video datasets from two platforms demonstrate the effectiveness, universality, and robustness of our MealRec, further uncovering the effective mechanism of our proposed TCD and NPD. The source code and corresponding dataset will be available upon acceptance.

Related papers

CounterVid: Counterfactual Video Generation for Mitigating Action and Temporal Hallucinations in Video-Language Models [66.56549019393042]
Video-language models (VLMs) achieve strong multimodal understanding but remain prone to hallucinations, especially when reasoning about actions and temporal order.<n>We propose a scalable framework for counterfactual video generation that synthesizes videos differing only in actions or temporal structure while preserving scene context.
arXiv Detail & Related papers (2026-01-08T10:03:07Z)
Enhanced Partially Relevant Video Retrieval through Inter- and Intra-Sample Analysis with Coherence Prediction [18.24629930062925]
Partially Relevant Video Retrieval aims to retrieve the target video that is partially relevant to a text query.<n>Existing methods coarsely align paired videos and text queries to construct the semantic space.<n>We propose a novel PRVR framework to systematically exploit inter-sample correlation and intra-sample redundancy.
arXiv Detail & Related papers (2025-04-28T09:52:46Z)
Dual Conditional Diffusion Models for Sequential Recommendation [63.82152785755723]
We propose Dual Conditional Diffusion Models for Sequential Recommendation (DCRec)<n>DCRec integrates implicit and explicit information by embedding dual conditions into both the forward and reverse diffusion processes.<n>This allows the model to retain valuable sequential and contextual information while leveraging explicit user-item interactions to guide the recommendation process.
arXiv Detail & Related papers (2024-10-29T11:51:06Z)
A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z)
CMMD: Contrastive Multi-Modal Diffusion for Video-Audio Conditional Modeling [21.380988939240844]
We introduce a multi-modal diffusion model tailored for the bi-directional conditional generation of video and audio. We propose a joint contrastive training loss to improve the synchronization between visual and auditory occurrences.
arXiv Detail & Related papers (2023-12-08T23:55:19Z)
LD4MRec: Simplifying and Powering Diffusion Model for Multimedia Recommendation [6.914898966090197]
We propose a Light Diffusion model for Multimedia Recommendation (LD4MRec)<n> LD4MRec employs a forward-free inference strategy, which directly predicts future behaviors from observed noisy behaviors.<n>Experiments conducted on three real-world datasets demonstrate the effectiveness of LD4MRec.
arXiv Detail & Related papers (2023-09-27T02:12:41Z)
DiffusionVMR: Diffusion Model for Joint Video Moment Retrieval and Highlight Detection [38.12212015133935]
A novel framework, DiffusionVMR, is proposed to redefine the two tasks as a unified conditional denoising generation process. Experiments conducted on five widely-used benchmarks demonstrate the effectiveness and flexibility of the proposed DiffusionVMR.
arXiv Detail & Related papers (2023-08-29T08:20:23Z)
Diffusion Recommender Model [85.9640416600725]
We propose a novel Diffusion Recommender Model (named DiffRec) to learn the generative process in a denoising manner.<n>To retain personalized information in user interactions, DiffRec reduces the added noises and avoids corrupting users' interactions into pure noises like in image synthesis.
arXiv Detail & Related papers (2023-04-11T04:31:00Z)
Deconfounded Video Moment Retrieval with Causal Intervention [80.90604360072831]
We tackle the task of video moment retrieval (VMR), which aims to localize a specific moment in a video according to a textual query. Existing methods primarily model the matching relationship between query and moment by complex cross-modal interactions. We propose a causality-inspired VMR framework that builds structural causal model to capture the true effect of query and video content on the prediction.
arXiv Detail & Related papers (2021-06-03T01:33:26Z)
Speech Prediction in Silent Videos using Variational Autoencoders [29.423462898526605]
We present a model for generating speech in a silent video. The proposed model combines recurrent neural networks and variational deep generative models to learn the auditory's conditional distribution. We demonstrate the performance of our model on the GRID dataset based on standard benchmarks.
arXiv Detail & Related papers (2020-11-14T17:09:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.