Related papers: CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models

CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models

URL: http://arxiv.org/abs/2505.13559v1
Date: Mon, 19 May 2025 09:18:14 GMT
Title: CS-Sum: A Benchmark for Code-Switching Dialogue Summarization and the Limits of Large Language Models
Authors: Sathya Krishnan Suresh, Tanmay Surana, Lim Zhi Hao, Eng Siong Chng,
Abstract summary: Code-switching (CS) poses a significant challenge for Large Language Models (LLMs)<n>We introduce CS-Sum, to evaluate the comprehensibility of CS by the LLMs through CS dialogue to English summarization.<n> CS-Sum is the first benchmark for CS dialogue summarization across Mandarin-English, Tamil-English, and Malay-English.
Score: 18.378069426713
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Code-switching (CS) poses a significant challenge for Large Language Models (LLMs), yet its comprehensibility remains underexplored in LLMs. We introduce CS-Sum, to evaluate the comprehensibility of CS by the LLMs through CS dialogue to English summarization. CS-Sum is the first benchmark for CS dialogue summarization across Mandarin-English (EN-ZH), Tamil-English (EN-TA), and Malay-English (EN-MS), with 900-1300 human-annotated dialogues per language pair. Evaluating ten LLMs, including open and closed-source models, we analyze performance across few-shot, translate-summarize, and fine-tuning (LoRA, QLoRA on synthetic data) approaches. Our findings show that though the scores on automated metrics are high, LLMs make subtle mistakes that alter the complete meaning of the dialogue. To this end, we introduce 3 most common type of errors that LLMs make when handling CS input. Error rates vary across CS pairs and LLMs, with some LLMs showing more frequent errors on certain language pairs, underscoring the need for specialized training on code-switched data.

Related papers

OLA: Output Language Alignment in Code-Switched LLM Interactions [31.119553472916234]
Code-switching, alternating between languages within a conversation, is natural for multilingual users, yet poses fundamental challenges for large language models.<n>We introduce OLA, a benchmark to evaluate LLMs' Output Language Alignment in code-switched interactions.<n>OLA focuses on Korean--English code-switching and spans simple intra-sentential mixing to instruction-content mismatches.
arXiv Detail & Related papers (2026-01-07T05:07:22Z)
Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models [1.175067374181304]
Code-switching, the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multilingual NLP.<n>Most large language models (LLMs) struggle with mixed-language inputs, limited CSW datasets, and evaluation biases.<n>This survey provides the first comprehensive analysis of CSW-aware LLM research.
arXiv Detail & Related papers (2025-10-08T14:04:14Z)
PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.<n>Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z)
Regional Tiny Stories: Using Small Models to Compare Language Learning and Tokenizer Performance [1.1784026260358966]
We focus on Hindi, Marathi, and Bengali, evaluating SLMs for regional language processing and understanding linguistic complexity.<n>Our analysis shows that language-specific tokenizers outperform general-purpose ones for Indian languages.<n>These findings advance both the practical application of SLMs to underserved languages and our theoretical understanding of neural language development.
arXiv Detail & Related papers (2025-04-07T10:33:14Z)
Probing LLMs for Multilingual Discourse Generalization Through a Unified Label Set [28.592959007943538]
This work investigates whether large language models (LLMs) capture discourse knowledge that generalizes across languages and frameworks.<n>Using multilingual discourse relation classification as a testbed, we examine a comprehensive set of 23 LLMs of varying sizes and multilingual capabilities.<n>Our results show that LLMs, especially those with multilingual training corpora, can generalize discourse information across languages and frameworks.
arXiv Detail & Related papers (2025-03-13T16:20:25Z)
Disparities in LLM Reasoning Accuracy and Explanations: A Case Study on African American English [66.97110551643722]
We investigate dialectal disparities in Large Language Models (LLMs) reasoning tasks.<n>We find that LLMs produce less accurate responses and simpler reasoning chains and explanations for AAE inputs.<n>These findings highlight systematic differences in how LLMs process and reason about different language varieties.
arXiv Detail & Related papers (2025-03-06T05:15:34Z)
Think Carefully and Check Again! Meta-Generation Unlocking LLMs for Low-Resource Cross-Lingual Summarization [108.6908427615402]
Cross-lingual summarization ( CLS) aims to generate a summary for the source text in a different target language.<n>Currently, instruction-tuned large language models (LLMs) excel at various English tasks.<n>Recent studies have shown that LLMs' performance on CLS tasks remains unsatisfactory even with few-shot settings.
arXiv Detail & Related papers (2024-10-26T00:39:44Z)
Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.<n>We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.<n>We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z)
Code-Switching Red-Teaming: LLM Evaluation for Safety and Multilingual Understanding [10.154013836043816]
Code-switching in red-teaming queries can effectively elicit undesirable behaviors of large language models (LLMs) We introduce a simple yet effective framework, CSRT, to synthesize code-switching red-teaming queries. We demonstrate that the CSRT significantly outperforms existing multilingual red-teaming techniques.
arXiv Detail & Related papers (2024-06-17T06:08:18Z)
CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models [53.9835961434552]
We introduce the Chinese Instruction-Following Benchmark (CIF-Bench) to evaluate the generalizability of large language models (LLMs) to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance.
arXiv Detail & Related papers (2024-02-20T16:02:12Z)
Machine Translation with Large Language Models: Prompt Engineering for Persian, English, and Russian Directions [0.0]
Generative large language models (LLMs) have demonstrated exceptional proficiency in various natural language processing (NLP) tasks. We conducted an investigation into two popular prompting methods and their combination, focusing on cross-language combinations of Persian, English, and Russian.
arXiv Detail & Related papers (2024-01-16T15:16:34Z)
Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models [51.75805497456226]
This work focuses on the factual consistency issue with the help of the dialogue summarization task. Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency. To stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data.
arXiv Detail & Related papers (2023-11-13T09:32:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.