Related papers: ViLLA-MMBench: A Unified Benchmark Suite for LLM-Augmented Multimodal Movie Recommendation

ViLLA-MMBench: A Unified Benchmark Suite for LLM-Augmented Multimodal Movie Recommendation

URL: http://arxiv.org/abs/2508.04206v1
Date: Wed, 06 Aug 2025 08:39:07 GMT
Title: ViLLA-MMBench: A Unified Benchmark Suite for LLM-Augmented Multimodal Movie Recommendation
Authors: Fatemeh Nazary, Ali Tourani, Yashar Deldjoo, Tommaso Di Noia,
Abstract summary: ViLLA-MMBench is a benchmark for multimodal movie recommendation.<n>It aligns dense item embeddings from three modalities: audio (block-level, i-vector), visual (CNN, AVF), and text.<n>Missing or sparse metadata is automatically enriched using state-of-the-art LLMs.
Score: 14.62192876151853
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recommending long-form video content demands joint modeling of visual, audio, and textual modalities, yet most benchmarks address only raw features or narrow fusion. We present ViLLA-MMBench, a reproducible, extensible benchmark for LLM-augmented multimodal movie recommendation. Built on MovieLens and MMTF-14K, it aligns dense item embeddings from three modalities: audio (block-level, i-vector), visual (CNN, AVF), and text. Missing or sparse metadata is automatically enriched using state-of-the-art LLMs (e.g., OpenAI Ada), generating high-quality synopses for thousands of movies. All text (raw or augmented) is embedded with configurable encoders (Ada, LLaMA-2, Sentence-T5), producing multiple ready-to-use sets. The pipeline supports interchangeable early-, mid-, and late-fusion (concatenation, PCA, CCA, rank-aggregation) and multiple backbones (MF, VAECF, VBPR, AMR, VMF) for ablation. Experiments are fully declarative via a single YAML file. Evaluation spans accuracy (Recall, nDCG) and beyond-accuracy metrics: cold-start rate, coverage, novelty, diversity, fairness. Results show LLM-based augmentation and strong text embeddings boost cold-start and coverage, especially when fused with audio-visual features. Systematic benchmarking reveals universal versus backbone- or metric-specific combinations. Open-source code, embeddings, and configs enable reproducible, fair multimodal RS research and advance principled generative AI integration in large-scale recommendation. Code: https://recsys-lab.github.io/ViLLA-MMBench

Related papers

MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding [25.20420111814606]
We present MLLM-Sampler Joint Evolution (MSJoE) for efficient long-form video understanding.<n>MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video.<n>A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process.
arXiv Detail & Related papers (2026-02-26T12:24:17Z)
Binge Watch: Reproducible Multimodal Benchmarks Datasets for Large-Scale Movie Recommendation on MovieLens-10M and 20M [36.76326963560822]
We release M3L-10M and M3L-20M, two large-scale, reproducible, multimodal datasets for the movie domain.<n>By following a fully documented pipeline, we collect movie plots, posters, and trailers, from which textual, visual, acoustic, and video features are extracted.<n>We publicly release mappings to download the original raw data, the extracted features, and the complete datasets in multiple formats.
arXiv Detail & Related papers (2026-02-17T11:22:20Z)
MMORE: Massive Multimodal Open RAG & Extraction [35.45122798365231]
MMORE is a pipeline to ingest, transform, and retrieve knowledge from heterogeneous document formats at scale.<n>MMORE supports more than fifteen file types, including text, tables, images, emails, audio, and video, and processes them into a unified format.<n>On processing benchmarks, MMORE demonstrates a 3.8-fold speedup over single-node baselines and 40% higher accuracy than Docling on scanned PDFs.
arXiv Detail & Related papers (2025-09-15T13:56:06Z)
Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference [88.57742986765238]
Free-MoRef is a training-free approach to multiplex the context perception capabilities of Video-MLLMs.<n>Experiments show that Free-MoRef achieves full perception of 2$times$ to 8$times$ longer input frames without compression on a single A100 GPU.
arXiv Detail & Related papers (2025-08-04T07:31:10Z)
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding [73.60257070465377]
AdaVideoRAG is a novel framework that adapts retrieval based on query complexity using a lightweight intent classifier.<n>Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs.<n> Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs.
arXiv Detail & Related papers (2025-06-16T15:18:15Z)
Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning [37.86612817818566]
We propose to obtain video LLMs whose reasoning steps are grounded in, and explicitly refer to, the relevant video frames.<n>Our approach is simple and self-contained, and, unlike existing approaches for video CoT, does not require auxiliary networks to select or caption relevant frames.<n>This, in turn, leads to improved performance across multiple video understanding benchmarks.
arXiv Detail & Related papers (2025-05-31T00:08:21Z)
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs [33.12165044958361]
Recent advances in Large Language Models (LLMs) show strong performance in speech recognition, including Audio-Visual Speech Recognition (AVSR)<n>To address this, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR.<n>Inspired by Matryoshka Representation Learning, our model encodes representations at multiple granularities with a single architecture.<n>For efficient fine-tuning, we introduce three LoRA-based strategies using global and scale-specific modules.
arXiv Detail & Related papers (2025-03-09T00:02:10Z)
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models [52.590072198551944]
Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents. For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data. In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM.
arXiv Detail & Related papers (2024-11-17T13:08:29Z)
MIBench: Evaluating Multimodal Large Language Models over Multiple Images [70.44423964171088]
We propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC) The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs.
arXiv Detail & Related papers (2024-07-21T21:22:58Z)
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs [88.28014831467503]
We introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations.
arXiv Detail & Related papers (2024-06-17T17:59:47Z)
OneLLM: One Framework to Align All Modalities with Language [86.8818857465443]
We present OneLLM, an MLLM that aligns eight modalities to language using a unified framework.<n>OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning.
arXiv Detail & Related papers (2023-12-06T18:59:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.