ViLLA-MMBench: A Unified Benchmark Suite for LLM-Augmented Multimodal Movie Recommendation
- URL: http://arxiv.org/abs/2508.04206v1
- Date: Wed, 06 Aug 2025 08:39:07 GMT
- Title: ViLLA-MMBench: A Unified Benchmark Suite for LLM-Augmented Multimodal Movie Recommendation
- Authors: Fatemeh Nazary, Ali Tourani, Yashar Deldjoo, Tommaso Di Noia,
- Abstract summary: ViLLA-MMBench is a benchmark for multimodal movie recommendation.<n>It aligns dense item embeddings from three modalities: audio (block-level, i-vector), visual (CNN, AVF), and text.<n>Missing or sparse metadata is automatically enriched using state-of-the-art LLMs.
- Score: 14.62192876151853
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recommending long-form video content demands joint modeling of visual, audio, and textual modalities, yet most benchmarks address only raw features or narrow fusion. We present ViLLA-MMBench, a reproducible, extensible benchmark for LLM-augmented multimodal movie recommendation. Built on MovieLens and MMTF-14K, it aligns dense item embeddings from three modalities: audio (block-level, i-vector), visual (CNN, AVF), and text. Missing or sparse metadata is automatically enriched using state-of-the-art LLMs (e.g., OpenAI Ada), generating high-quality synopses for thousands of movies. All text (raw or augmented) is embedded with configurable encoders (Ada, LLaMA-2, Sentence-T5), producing multiple ready-to-use sets. The pipeline supports interchangeable early-, mid-, and late-fusion (concatenation, PCA, CCA, rank-aggregation) and multiple backbones (MF, VAECF, VBPR, AMR, VMF) for ablation. Experiments are fully declarative via a single YAML file. Evaluation spans accuracy (Recall, nDCG) and beyond-accuracy metrics: cold-start rate, coverage, novelty, diversity, fairness. Results show LLM-based augmentation and strong text embeddings boost cold-start and coverage, especially when fused with audio-visual features. Systematic benchmarking reveals universal versus backbone- or metric-specific combinations. Open-source code, embeddings, and configs enable reproducible, fair multimodal RS research and advance principled generative AI integration in large-scale recommendation. Code: https://recsys-lab.github.io/ViLLA-MMBench
Related papers
- MSJoE: Jointly Evolving MLLM and Sampler for Efficient Long-Form Video Understanding [25.20420111814606]
We present MLLM-Sampler Joint Evolution (MSJoE) for efficient long-form video understanding.<n>MSJoE builds upon a key assumption that only a small subset of key-frames is truly informative for answering each question to a video.<n>A new long-video QA dataset containing 2.8K videos with 7K question-answer pairs is collected to support the training process.
arXiv Detail & Related papers (2026-02-26T12:24:17Z) - Binge Watch: Reproducible Multimodal Benchmarks Datasets for Large-Scale Movie Recommendation on MovieLens-10M and 20M [36.76326963560822]
We release M3L-10M and M3L-20M, two large-scale, reproducible, multimodal datasets for the movie domain.<n>By following a fully documented pipeline, we collect movie plots, posters, and trailers, from which textual, visual, acoustic, and video features are extracted.<n>We publicly release mappings to download the original raw data, the extracted features, and the complete datasets in multiple formats.
arXiv Detail & Related papers (2026-02-17T11:22:20Z) - MMORE: Massive Multimodal Open RAG & Extraction [35.45122798365231]
MMORE is a pipeline to ingest, transform, and retrieve knowledge from heterogeneous document formats at scale.<n>MMORE supports more than fifteen file types, including text, tables, images, emails, audio, and video, and processes them into a unified format.<n>On processing benchmarks, MMORE demonstrates a 3.8-fold speedup over single-node baselines and 40% higher accuracy than Docling on scanned PDFs.
arXiv Detail & Related papers (2025-09-15T13:56:06Z) - Free-MoRef: Instantly Multiplexing Context Perception Capabilities of Video-MLLMs within Single Inference [88.57742986765238]
Free-MoRef is a training-free approach to multiplex the context perception capabilities of Video-MLLMs.<n>Experiments show that Free-MoRef achieves full perception of 2$times$ to 8$times$ longer input frames without compression on a single A100 GPU.
arXiv Detail & Related papers (2025-08-04T07:31:10Z) - AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding [73.60257070465377]
AdaVideoRAG is a novel framework that adapts retrieval based on query complexity using a lightweight intent classifier.<n>Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs.<n> Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs.
arXiv Detail & Related papers (2025-06-16T15:18:15Z) - Chain-of-Frames: Advancing Video Understanding in Multimodal LLMs via Frame-Aware Reasoning [37.86612817818566]
We propose to obtain video LLMs whose reasoning steps are grounded in, and explicitly refer to, the relevant video frames.<n>Our approach is simple and self-contained, and, unlike existing approaches for video CoT, does not require auxiliary networks to select or caption relevant frames.<n>This, in turn, leads to improved performance across multiple video understanding benchmarks.
arXiv Detail & Related papers (2025-05-31T00:08:21Z) - Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs [33.12165044958361]
Recent advances in Large Language Models (LLMs) show strong performance in speech recognition, including Audio-Visual Speech Recognition (AVSR)<n>To address this, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR.<n>Inspired by Matryoshka Representation Learning, our model encodes representations at multiple granularities with a single architecture.<n>For efficient fine-tuning, we introduce three LoRA-based strategies using global and scale-specific modules.
arXiv Detail & Related papers (2025-03-09T00:02:10Z) - TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models [52.590072198551944]
Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents.
For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data.
In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM.
arXiv Detail & Related papers (2024-11-17T13:08:29Z) - MIBench: Evaluating Multimodal Large Language Models over Multiple Images [70.44423964171088]
We propose a new benchmark MIBench, to comprehensively evaluate fine-grained abilities of MLLMs in multi-image scenarios.
Specifically, MIBench categorizes the multi-image abilities into three scenarios: multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC)
The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with multi-image inputs.
arXiv Detail & Related papers (2024-07-21T21:22:58Z) - MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs [88.28014831467503]
We introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset.
MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks.
We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations.
arXiv Detail & Related papers (2024-06-17T17:59:47Z) - OneLLM: One Framework to Align All Modalities with Language [86.8818857465443]
We present OneLLM, an MLLM that aligns eight modalities to language using a unified framework.<n>OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning.
arXiv Detail & Related papers (2023-12-06T18:59:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.