Related papers: Can MLLMs Generalize to Multi-Party dialog? Exploring Multilingual Response Generation in Complex Scenarios

Can MLLMs Generalize to Multi-Party dialog? Exploring Multilingual Response Generation in Complex Scenarios

URL: http://arxiv.org/abs/2501.11269v2
Date: Mon, 19 May 2025 07:25:44 GMT
Title: Can MLLMs Generalize to Multi-Party dialog? Exploring Multilingual Response Generation in Complex Scenarios
Authors: Zhongtian Hu, Yiwen Cui, Ronghan Li, Meng Zhao, Lifang Wang,
Abstract summary: We introduce XMP, a high-quality parallel Multilingual dataset sourced from Multi-party Podcast dialogues.<n>Most samples in the dataset feature three or more participants, discussing a wide range of topics.<n>We find that, R1: MLLMs fail to generalize to multi-party setting, R2 Fine-tuning on XMP improves only marginally, with the 70B model achieving at most a 1% absolute gain over its 8B counterpart.
Score: 8.131774353504472
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current multilingual large language models(MLLMs) still focus on simple question-answering formats, often overlooking more complex dialogue scenarios. In other words, their capabilities of multilingual large models have yet to be validated in dialogue tasks with intricate structures. We therefore ask, Q1: How well do LLMs generalize to more complex dialog scenarios? Q2: Can supervised fine-tuning on a high-quality parallel benchmark restore this ability? Q3: Does the "multilingual complementarity" effect survive in the setting? To answer these questions, we introduce XMP, a high-quality parallel Multilingual dataset sourced from Multi-party Podcast dialogues, which is the first parallel dataset focusing on multi-party dialogue scenarios. Most samples in the dataset feature three or more participants, discussing a wide range of topics. Through extensive experiments, we find that, R1: MLLMs fail to generalize to multi-party setting, R2 Fine-tuning on XMP improves only marginally, with the 70B model achieving at most a 1% absolute gain over its 8B counterpart; R3: Mixing languages during SFT is usually detrimental, with any benefits being marginal and limited to isolated cases in the 70B model.

Related papers

MuCo: Multi-turn Contrastive Learning for Multimodal Embedding Model [57.89395815934156]
Multi-Turn Contrastive Learning (MuCo) is a dialogue-inspired framework that revisits this process.<n>Experiments exhibit MuCo with a newly curated 5M multimodal multi-turn dataset (M3T)
arXiv Detail & Related papers (2026-02-06T05:18:33Z)
A Multimodal Conversational Agent for Tabular Data Analysis [0.2211620227346065]
Large language models (LLMs) can reshape information processing by handling data analysis, visualization, and interpretation in an interactive, context-aware dialogue with users, including voice interaction, while maintaining high performance.<n>We present Talk2Data, a multimodal LLM-driven conversational agent for intuitive data exploration.<n>The system lets users query datasets with voice or text instructions and receive answers as plots, tables, statistics, or spoken explanations.
arXiv Detail & Related papers (2025-11-23T11:21:04Z)
KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus [69.46707346122113]
We propose a novel task and create a human-to-human video-driven multilingual mixed-type dialogue corpus. The KwaiChat corpus contains a total of 93,209 videos and 246,080 dialogues, across 4 dialogue types, 30 domains, 4 languages, and 13 topics. An analysis of 7 distinct LLMs on KwaiChat reveals that GPT-4o achieves the best performance but still cannot perform well in this situation.
arXiv Detail & Related papers (2025-03-10T04:05:38Z)
Open-Source Large Language Models as Multilingual Crowdworkers: Synthesizing Open-Domain Dialogues in Several Languages With No Examples in Targets and No Machine Translation [1.7436854281619139]
We introduce a pipeline for generating Open-Domain Dialogue data in multiple Target Languages using Large Language Models. To enhance the openness of generated dialogues and mimic real life scenarii, we added the notion of speech events corresponding to the type of conversation the speakers are involved in.
arXiv Detail & Related papers (2025-03-05T12:52:14Z)
Multi-Party Supervised Fine-tuning of Language Models for Multi-Party Dialogue Generation [11.340007143339657]
Large Language Models (LLM) are usually fine-tuned to participate in dyadic or two-party dialogues.<n>In this work, we design a multi-party fine-tuning framework (MuPaS) for LLMs on the multi-party dialogue datasets.
arXiv Detail & Related papers (2024-12-06T09:33:47Z)
VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning [64.56272011710735]
We propose a novel single-stage joint speech-text SFT approach on the low-rank adaptation (LoRA) of the large language models (LLMs) backbone.<n>Compared to previous SpeechLMs with 7B or 13B parameters, our 3B model demonstrates superior performance across various speech benchmarks.
arXiv Detail & Related papers (2024-10-23T00:36:06Z)
Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments. Our approach utilizes WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context. Comprehensive experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z)
Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs [63.29737699997859]
Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning. In this work, we expose frozen LLMs to image, video, audio and text inputs and analyse their internal representation.
arXiv Detail & Related papers (2024-05-26T21:31:59Z)
A Survey on Large Language Models with Multilingualism: Recent Advances and New Frontiers [51.8203871494146]
The rapid development of Large Language Models (LLMs) demonstrates remarkable multilingual capabilities in natural language processing.<n>Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient.<n>This survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.
arXiv Detail & Related papers (2024-05-17T17:47:39Z)
Teaching a Multilingual Large Language Model to Understand Multilingual Speech via Multi-Instructional Training [29.47243668154796]
BLOOMZMMS is a novel model that integrates a multilingual LLM with a multilingual speech encoder. We demonstrate the transferability of linguistic knowledge from the text to the speech modality. Our zero-shot evaluation results confirm the robustness of our approach across multiple tasks.
arXiv Detail & Related papers (2024-04-16T21:45:59Z)
How do Large Language Models Handle Multilingualism? [81.15060972112563]
This study explores how large language models (LLMs) handle multilingualism. LLMs initially understand the query, converting multilingual inputs into English for task-solving. In the intermediate layers, they employ English for thinking and incorporate multilingual knowledge with self-attention and feed-forward structures.
arXiv Detail & Related papers (2024-02-29T02:55:26Z)
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues [58.33076950775072]
MT-Bench-101 is designed to evaluate the fine-grained abilities of Large Language Models (LLMs) in multi-turn dialogues. We construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives.
arXiv Detail & Related papers (2024-02-22T18:21:59Z)
Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems [64.40789703661987]
Multi3WOZ is a novel multilingual, multi-domain, multi-parallel ToD dataset. It is large-scale and offers culturally adapted dialogs in 4 languages. We describe a complex bottom-up data collection process that yielded the final dataset.
arXiv Detail & Related papers (2023-07-26T08:29:42Z)
Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models [69.59125732317972]
We propose a simple yet effective Retrieving-to-Answer (R2A) framework for VideoQA. R2A first retrieves a set of semantically similar texts from a generic text corpus using a pre-trained multi-modal model. With both the question and the retrieved texts, a LLM can be directly used to yield a desired answer.
arXiv Detail & Related papers (2023-06-15T20:56:20Z)
Cue-CoT: Chain-of-thought Prompting for Responding to In-depth Dialogue Questions with LLMs [59.74002011562726]
We propose a novel linguistic cue-based chain-of-thoughts (textitCue-CoT) to provide a more personalized and engaging response. We build a benchmark with in-depth dialogue questions, consisting of 6 datasets in both Chinese and English. Empirical results demonstrate our proposed textitCue-CoT method outperforms standard prompting methods in terms of both textithelpfulness and textitacceptability on all datasets.
arXiv Detail & Related papers (2023-05-19T16:27:43Z)
Efficiently Aligned Cross-Lingual Transfer Learning for Conversational Tasks using Prompt-Tuning [98.60739735409243]
Cross-lingual transfer of language models trained on high-resource languages like English has been widely studied for many NLP tasks. We introduce XSGD for cross-lingual alignment pretraining, a parallel and large-scale multilingual conversation dataset. To facilitate aligned cross-lingual representations, we develop an efficient prompt-tuning-based method for learning alignment prompts.
arXiv Detail & Related papers (2023-04-03T18:46:01Z)
MULTI3NLU++: A Multilingual, Multi-Intent, Multi-Domain Dataset for Natural Language Understanding in Task-Oriented Dialogue [115.32009638844059]
We extend the English only NLU++ dataset to include manual translations into a range of high, medium, and low resource languages. Because of its multi-intent property, MULTI3NLU++ represents complex and natural user goals. We use MULTI3NLU++ to benchmark state-of-the-art multilingual models for the Natural Language Understanding tasks of intent detection and slot labelling.
arXiv Detail & Related papers (2022-12-20T17:34:25Z)
MFAQ: a Multilingual FAQ Dataset [9.625301186732598]
We present the first multilingual FAQ dataset publicly available. We collected around 6M FAQ pairs from the web, in 21 different languages. We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset.
arXiv Detail & Related papers (2021-09-27T08:43:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.