Related papers: Lost in Speech: Benchmarking, Evaluation, and Parsing of Spoken Code-Switching Beyond Standard UD Assumptions

Lost in Speech: Benchmarking, Evaluation, and Parsing of Spoken Code-Switching Beyond Standard UD Assumptions

URL: http://arxiv.org/abs/2602.06307v1
Date: Fri, 06 Feb 2026 02:02:07 GMT
Title: Lost in Speech: Benchmarking, Evaluation, and Parsing of Spoken Code-Switching Beyond Standard UD Assumptions
Authors: Nemika Tyagi, Holly Hendrix, Nelvin Licona-Guevara, Justin Mackie, Phanos Kareen, Muhammad Imran, Megan Michelle Smith, Tatiana Gallego Hernande, Chitta Baral, Olga Kellert,
Abstract summary: Spoken code-switching (CSW) challenges syntactic parsing in ways not observed in written text.<n>Disfluencies, repetition, ellipsis, and discourse-driven structure routinely violate standard Universal Dependencies (UD) assumptions.
Score: 23.2725831877861
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spoken code-switching (CSW) challenges syntactic parsing in ways not observed in written text. Disfluencies, repetition, ellipsis, and discourse-driven structure routinely violate standard Universal Dependencies (UD) assumptions, causing parsers and large language models (LLMs) to fail despite strong performance on written data. These failures are compounded by rigid evaluation metrics that conflate genuine structural errors with acceptable variation. In this work, we present a systems-oriented approach to spoken CSW parsing. We introduce a linguistically grounded taxonomy of spoken CSW phenomena and SpokeBench, an expert-annotated gold benchmark designed to test spoken-language structure beyond standard UD assumptions. We further propose FLEX-UD, an ambiguity-aware evaluation metric, which reveals that existing parsing techniques perform poorly on spoken CSW by penalizing linguistically plausible analyses as errors. We then propose DECAP, a decoupled agentic parsing framework that isolates spoken-phenomena handling from core syntactic analysis. Experiments show that DECAP produces more robust and interpretable parses without retraining and achieves up to 52.6% improvements over existing parsing techniques. FLEX-UD evaluations further reveal qualitative improvements that are masked by standard metrics.

Related papers

AQAScore: Evaluating Semantic Alignment in Text-to-Audio Generation via Audio Question Answering [97.52852990265136]
We introduce AQAScore, a backbone-agnostic evaluation framework that leverages the reasoning capabilities of audio-aware large language models.<n>We evaluate AQAScore across multiple benchmarks, including human-rated relevance, pairwise comparison, and compositional reasoning tasks.
arXiv Detail & Related papers (2026-01-21T07:35:36Z)
Do LLMs Know They Are Being Tested? Evaluation Awareness and Incentive-Sensitive Failures in GPT-OSS-20B [1.948261185683419]
We investigate whether "evaluation scent" inflates measured performance without commensurate capability gains.<n>We run six paired A/B scenarios that hold task content and decoding fixed while varying framing.<n>We provide a reproducible A/B framework (prompt banks, validators, per-run scores, scripts) and practical guidance.
arXiv Detail & Related papers (2025-10-08T09:49:05Z)
Test Case Generation from Bug Reports via Large Language Models: A Cognitive Layered Evaluation Framework [10.919459368597295]
We present a systematic evaluation of Large Language Models (LLMs) reasoning in test case generation.<n>We evaluate StarCoder and GPT-4o on Defects4J, GHRB, and mutated variants that introduce linguistic and semantic challenges.
arXiv Detail & Related papers (2025-10-06T20:47:12Z)
Speech-IFEval: Evaluating Instruction-Following and Quantifying Catastrophic Forgetting in Speech-Aware Language Models [49.1574468325115]
We introduce Speech-IFeval, an evaluation framework designed to assess instruction-following capabilities.<n>Recent SLMs integrate speech perception with large language models (LLMs), often degrading textual capabilities due to speech-centric training.<n>Our findings show that most SLMs struggle with even basic instructions, performing far worse than text-based LLMs.
arXiv Detail & Related papers (2025-05-25T08:37:55Z)
LLM-Based Evaluation of Low-Resource Machine Translation: A Reference-less Dialect Guided Approach with a Refined Sylheti-English Benchmark [1.3927943269211591]
We propose a comprehensive framework that enhances Large Language Models (LLMs)-based machine translation evaluation.<n>We extend the ONUBAD dataset by incorporating Sylheti-English sentence pairs, corresponding machine translations, and Direct Assessment (DA) scores annotated by native speakers.<n>Our evaluation shows that the proposed pipeline consistently outperforms existing methods, achieving the highest gain of +0.1083 in Spearman correlation.
arXiv Detail & Related papers (2025-05-18T07:24:13Z)
Say It Another Way: Auditing LLMs with a User-Grounded Automated Paraphrasing Framework [17.91981142492207]
We introduce AUGMENT, a framework for generating controlled paraphrases grounded in user behaviors.<n>AUGMENT leverages linguistically informed rules and enforces quality through checks on instruction adherence, semantic similarity, and realism.<n>Case studies show that controlled paraphrases uncover systematic weaknesses that remain obscured under unconstrained variation.
arXiv Detail & Related papers (2025-05-06T14:17:30Z)
What is lost in Normalization? Exploring Pitfalls in Multilingual ASR Model Evaluations [0.0]
We investigate the text normalization routine employed by leading ASR models, including OpenAI Whisper, Meta's MMS, Seamless, and Assembly AI's Conformer. Our research reveals that current text normalization practices, while aiming to standardize ASR outputs for fair comparison, are fundamentally flawed when applied to Indic scripts. We propose a shift towards developing text normalization routines that leverage native linguistic expertise.
arXiv Detail & Related papers (2024-09-04T05:08:23Z)
Machine Translation Meta Evaluation through Translation Accuracy Challenge Sets [92.38654521870444]
We introduce ACES, a contrastive challenge set spanning 146 language pairs. This dataset aims to discover whether metrics can identify 68 translation accuracy errors. We conduct a large-scale study by benchmarking ACES on 50 metrics submitted to the WMT 2022 and 2023 metrics shared tasks.
arXiv Detail & Related papers (2024-01-29T17:17:42Z)
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses. LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z)
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics. We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z)
Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation [59.307534363825816]
Negation is poorly captured by current language models, although the extent of this problem is not widely understood. We introduce a natural language inference (NLI) test suite to enable probing the capabilities of NLP methods.
arXiv Detail & Related papers (2022-10-06T23:39:01Z)
Pareto Probing: Trading Off Accuracy for Complexity [87.09294772742737]
We argue for a probe metric that reflects the fundamental trade-off between probe complexity and performance. Our experiments with dependency parsing reveal a wide gap in syntactic knowledge between contextual and non-contextual representations.
arXiv Detail & Related papers (2020-10-05T17:27:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.