TALKPLAY: Multimodal Music Recommendation with Large Language Models
- URL: http://arxiv.org/abs/2502.13713v3
- Date: Wed, 26 Feb 2025 01:00:37 GMT
- Title: TALKPLAY: Multimodal Music Recommendation with Large Language Models
- Authors: Seungheon Doh, Keunwoo Choi, Juhan Nam,
- Abstract summary: TalkPlay represents music through an expanded token vocabulary that encodes multiple modalities.<n>The model learns to generate recommendations through next-token prediction on music recommendation conversations.<n>Our approach eliminates traditional recommendation-dialogue pipeline complexity, enabling end-to-end learning of query-aware music recommendations.
- Score: 6.830154140450626
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present TalkPlay, a multimodal music recommendation system that reformulates the recommendation task as large language model token generation. TalkPlay represents music through an expanded token vocabulary that encodes multiple modalities - audio, lyrics, metadata, semantic tags, and playlist co-occurrence. Using these rich representations, the model learns to generate recommendations through next-token prediction on music recommendation conversations, that requires learning the associations natural language query and response, as well as music items. In other words, the formulation transforms music recommendation into a natural language understanding task, where the model's ability to predict conversation tokens directly optimizes query-item relevance. Our approach eliminates traditional recommendation-dialogue pipeline complexity, enabling end-to-end learning of query-aware music recommendations. In the experiment, TalkPlay is successfully trained and outperforms baseline methods in various aspects, demonstrating strong context understanding as a conversational music recommender.
Related papers
- Just Ask for Music (JAM): Multimodal and Personalized Natural Language Music Recommendation [47.05078668091976]
We present JAM (Just Ask for Music), a lightweight and intuitive framework for natural language music recommendation.<n>To capture the complexity of music and user intent, JAM aggregates multimodal item features via cross-attention and sparse mixture-of-experts.<n>Our results show that JAM provides accurate recommendations, produces intuitive representations suitable for practical use cases, and can be easily integrated with existing music recommendation stacks.
arXiv Detail & Related papers (2025-07-21T17:36:03Z) - NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction [59.44357187878676]
We introduce a novel generative modeling paradigm, Next-Token-Pair Prediction (NTPP), to enable speaker-independent dual-channel spoken dialogue learning.<n>We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness.
arXiv Detail & Related papers (2025-06-01T12:01:40Z) - System Message Generation for User Preferences using Open-Source Models [4.387048445855714]
System messages play a crucial role in interactions with large language models (LLMs)<n>We introduce SysGen, a pipeline for generating system messages that better align assistant responses with user instructions.<n>Training open-source models on SysGen data yields substantial improvements in both single-turn (Multifacet) and multi-turn (SysBench) conversation benchmarks.
arXiv Detail & Related papers (2025-02-17T01:05:31Z) - Music Discovery Dialogue Generation Using Human Intent Analysis and Large Language Models [10.022036983890091]
We present a data generation framework for rich music discovery dialogue using a large language model (LLM) and user intents, system actions, and musical attributes.
By applying this framework to the Million Song dataset, we create LP-MusicDialog, a Large Language Model based Pseudo Music Dialogue dataset.
Our evaluation shows that the synthetic dataset is competitive with an existing, small human dialogue dataset.
arXiv Detail & Related papers (2024-11-11T23:40:45Z) - OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation [53.7173034249361]
End-to-end GPT-based model OmniFlatten capable of effectively modeling complex behaviors inherent natural conversations with low latency.<n>Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full- spoken dialogue systems.
arXiv Detail & Related papers (2024-10-23T11:58:58Z) - Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.<n>We use WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.<n>Experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z) - Large Language Model Driven Recommendation [34.45328907249946]
The advent of language-driven recommendation has unlocked the use of natural language (NL) interactions for recommendation.<n>This chapter discusses how LLMs' abilities for general NL reasoning present novel opportunities to build highly personalized RSs.
arXiv Detail & Related papers (2024-08-20T15:36:24Z) - MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models [11.834712543531756]
MuChoMusic is a benchmark for evaluating music understanding in multimodal language models focused on audio.
It comprises 1,187 multiple-choice questions, all validated by human annotators, on 644 music tracks sourced from two publicly available music datasets.
We evaluate five open-source models and identify several pitfalls, including an over-reliance on the language modality.
arXiv Detail & Related papers (2024-08-02T15:34:05Z) - Item-Language Model for Conversational Recommendation [24.00379652557269]
We propose an Item-Language Model (ILM) to produce text-aligned item representations that encode user interaction signals.<n>We conduct extensive experiments which demonstrate both the importance of the language-alignment and of user interaction knowledge in the item encoder.
arXiv Detail & Related papers (2024-06-05T01:35:50Z) - Parameter-Efficient Conversational Recommender System as a Language
Processing Task [52.47087212618396]
Conversational recommender systems (CRS) aim to recommend relevant items to users by eliciting user preference through natural language conversation.
Prior work often utilizes external knowledge graphs for items' semantic information, a language model for dialogue generation, and a recommendation module for ranking relevant items.
In this paper, we represent items in natural language and formulate CRS as a natural language processing task.
arXiv Detail & Related papers (2024-01-25T14:07:34Z) - MuseChat: A Conversational Music Recommendation System for Videos [12.47508840909336]
MuseChat is a first-of-its-kind dialogue-based recommendation system that personalizes music suggestions for videos.
Our system consists of two key functionalities with associated modules: recommendation and reasoning.
Experiment results show that MuseChat achieves significant improvements over existing video-based music retrieval methods.
arXiv Detail & Related papers (2023-10-10T03:32:33Z) - Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words.
Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE.
We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z) - Large Language Models as Zero-Shot Conversational Recommenders [52.57230221644014]
We present empirical studies on conversational recommendation tasks using representative large language models in a zero-shot setting.
We construct a new dataset of recommendation-related conversations by scraping a popular discussion website.
We observe that even without fine-tuning, large language models can outperform existing fine-tuned conversational recommendation models.
arXiv Detail & Related papers (2023-08-19T15:29:45Z) - Language-Guided Music Recommendation for Video via Prompt Analogies [35.48998901411509]
We propose a method to recommend music for an input video while allowing a user to guide music selection with free-form natural language.
Existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music.
arXiv Detail & Related papers (2023-06-15T17:58:01Z) - Leveraging Large Language Models in Conversational Recommender Systems [9.751217336860924]
A Conversational Recommender System (CRS) offers increased transparency and control to users by enabling them to engage with the system through a real-time multi-turn dialogue.
Large Language Models (LLMs) have exhibited an unprecedented ability to converse naturally and incorporate world knowledge and common-sense reasoning into language understanding.
arXiv Detail & Related papers (2023-05-13T16:40:07Z) - Talk the Walk: Synthetic Data Generation for Conversational Music
Recommendation [62.019437228000776]
We present TalkWalk, which generates realistic high-quality conversational data by leveraging encoded expertise in widely available item collections.
We generate over one million diverse conversations in a human-collected dataset.
arXiv Detail & Related papers (2023-01-27T01:54:16Z) - ALCAP: Alignment-Augmented Music Captioner [34.85003676798762]
We introduce a method to learn multimodal alignment between audio and lyrics through contrastive learning.
This not only recognizes and emphasizes the synergy between audio and lyrics but also paves the way for models to achieve deeper cross-modal coherence.
arXiv Detail & Related papers (2022-12-21T10:20:54Z) - AudioLM: a Language Modeling Approach to Audio Generation [59.19364975706805]
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency.
We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure.
We demonstrate how our approach extends beyond speech by generating coherent piano music continuations.
arXiv Detail & Related papers (2022-09-07T13:40:08Z) - Contrastive Audio-Language Learning for Music [13.699088044513562]
MusCALL is a framework for Music Contrastive Audio-Language Learning.
Our approach consists of a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences.
arXiv Detail & Related papers (2022-08-25T16:55:15Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.