Just Ask for Music (JAM): Multimodal and Personalized Natural Language Music Recommendation
- URL: http://arxiv.org/abs/2507.15826v1
- Date: Mon, 21 Jul 2025 17:36:03 GMT
- Title: Just Ask for Music (JAM): Multimodal and Personalized Natural Language Music Recommendation
- Authors: Alessandro B. Melchiorre, Elena V. Epure, Shahed Masoudian, Gustavo Escobedo, Anna Hausberger, Manuel Moussallam, Markus Schedl,
- Abstract summary: We present JAM (Just Ask for Music), a lightweight and intuitive framework for natural language music recommendation.<n>To capture the complexity of music and user intent, JAM aggregates multimodal item features via cross-attention and sparse mixture-of-experts.<n>Our results show that JAM provides accurate recommendations, produces intuitive representations suitable for practical use cases, and can be easily integrated with existing music recommendation stacks.
- Score: 47.05078668091976
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural language interfaces offer a compelling approach for music recommendation, enabling users to express complex preferences conversationally. While Large Language Models (LLMs) show promise in this direction, their scalability in recommender systems is limited by high costs and latency. Retrieval-based approaches using smaller language models mitigate these issues but often rely on single-modal item representations, overlook long-term user preferences, and require full model retraining, posing challenges for real-world deployment. In this paper, we present JAM (Just Ask for Music), a lightweight and intuitive framework for natural language music recommendation. JAM models user-query-item interactions as vector translations in a shared latent space, inspired by knowledge graph embedding methods like TransE. To capture the complexity of music and user intent, JAM aggregates multimodal item features via cross-attention and sparse mixture-of-experts. We also introduce JAMSessions, a new dataset of over 100k user-query-item triples with anonymized user/item embeddings, uniquely combining conversational queries and user long-term preferences. Our results show that JAM provides accurate recommendations, produces intuitive representations suitable for practical use cases, and can be easily integrated with existing music recommendation stacks.
Related papers
- Creating General User Models from Computer Use [62.91116265732001]
This paper presents an architecture for a general user model (GUM) that learns about you by observing any interaction you have with your computer.<n>The GUM takes as input any unstructured observation of a user (e.g., device screenshots) and constructs confidence-weighted propositions that capture user knowledge and preferences.
arXiv Detail & Related papers (2025-05-16T04:00:31Z) - HistLLM: A Unified Framework for LLM-Based Multimodal Recommendation with User History Encoding and Compression [33.34435467588446]
HistLLM is an innovative framework that integrates textual and visual features through a User History.<n> Module (UHEM), compressing user history interactions into a single token representation.<n>Extensive experiments demonstrate the effectiveness and efficiency of our proposed mechanism.
arXiv Detail & Related papers (2025-04-14T12:01:11Z) - TALKPLAY: Multimodal Music Recommendation with Large Language Models [6.830154140450626]
We present TALKPLAY, a novel multimodal music recommendation system that reformulates recommendation as a token generation problem using large language models (LLMs)<n>Our system effectively recommends music from diverse user queries while generating contextually relevant responses.<n>Our qualitative and quantitative evaluation demonstrates that TALKPLAY significantly outperforms unimodal approaches based solely on text or listening history in both recommendation performance and conversational naturalness.
arXiv Detail & Related papers (2025-02-19T13:28:20Z) - LLM-ESR: Large Language Models Enhancement for Long-tailed Sequential Recommendation [58.04939553630209]
In real-world systems, most users interact with only a handful of items, while the majority of items are seldom consumed.
These two issues, known as the long-tail user and long-tail item challenges, often pose difficulties for existing Sequential Recommendation systems.
We propose the Large Language Models Enhancement framework for Sequential Recommendation (LLM-ESR) to address these challenges.
arXiv Detail & Related papers (2024-05-31T07:24:42Z) - Parameter-Efficient Conversational Recommender System as a Language
Processing Task [52.47087212618396]
Conversational recommender systems (CRS) aim to recommend relevant items to users by eliciting user preference through natural language conversation.
Prior work often utilizes external knowledge graphs for items' semantic information, a language model for dialogue generation, and a recommendation module for ranking relevant items.
In this paper, we represent items in natural language and formulate CRS as a natural language processing task.
arXiv Detail & Related papers (2024-01-25T14:07:34Z) - Interpreting User Requests in the Context of Natural Language Standing
Instructions [89.12540932734476]
We develop NLSI, a language-to-program dataset consisting of over 2.4K dialogues spanning 17 domains.
A key challenge in NLSI is to identify which subset of the standing instructions is applicable to a given dialogue.
arXiv Detail & Related papers (2023-11-16T11:19:26Z) - Large Language Models are Competitive Near Cold-start Recommenders for
Language- and Item-based Preferences [33.81337282939615]
dialog interfaces that allow users to express language-based preferences offer a fundamentally different modality for preference input.
Inspired by recent successes of prompting paradigms for large language models (LLMs), we study their use for making recommendations.
arXiv Detail & Related papers (2023-07-26T14:47:15Z) - Beyond Single Items: Exploring User Preferences in Item Sets with the
Conversational Playlist Curation Dataset [20.42354123651454]
We call this task conversational item set curation.
We present a novel data collection methodology that efficiently collects realistic preferences about item sets in a conversational setting.
We show that it leads raters to express preferences that would not be otherwise expressed.
arXiv Detail & Related papers (2023-03-13T00:39:04Z) - Talk the Walk: Synthetic Data Generation for Conversational Music
Recommendation [62.019437228000776]
We present TalkWalk, which generates realistic high-quality conversational data by leveraging encoded expertise in widely available item collections.
We generate over one million diverse conversations in a human-collected dataset.
arXiv Detail & Related papers (2023-01-27T01:54:16Z) - COLA: Improving Conversational Recommender Systems by Collaborative
Augmentation [9.99763097964222]
We propose a collaborative augmentation (COLA) method to improve both item representation learning and user preference modeling.
We construct an interactive user-item graph from all conversations, which augments item representations with user-aware information.
To improve user preference modeling, we retrieve similar conversations from the training corpus, where the involved items and attributes that reflect the user's potential interests are used to augment the user representation.
arXiv Detail & Related papers (2022-12-15T12:37:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.