Beyond Length: Context-Aware Expansion and Independence as Developmentally Sensitive Evaluation in Child Utterances
- URL: http://arxiv.org/abs/2602.05392v1
- Date: Thu, 05 Feb 2026 07:19:04 GMT
- Title: Beyond Length: Context-Aware Expansion and Independence as Developmentally Sensitive Evaluation in Child Utterances
- Authors: Jiyun Chun, Eric Fosler-Lussier, Michael White, Andrew Perrault,
- Abstract summary: evaluating the quality of children's utterances in adult-child dialogue remains challenging due to insufficient context-sensitive metrics.<n>Common proxies such as Mean Length of Utterance (MLU), lexical diversity (vocd-D), and readability indices (Flesch-Kincaid Grade Level, Gunning Fog Index) are dominated by length and ignore conversational context.<n>We introduce an LLM-as-a-judge framework that first classifies the Previous Adult Utterance Type and then scores the child's response along two axes: Expansion (contextual elaboration and inferential depth) and Independence (the child's
- Score: 17.781201178950973
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating the quality of children's utterances in adult-child dialogue remains challenging due to insufficient context-sensitive metrics. Common proxies such as Mean Length of Utterance (MLU), lexical diversity (vocd-D), and readability indices (Flesch-Kincaid Grade Level, Gunning Fog Index) are dominated by length and ignore conversational context, missing aspects of response quality such as reasoning depth, topic maintenance, and discourse planning. We introduce an LLM-as-a-judge framework that first classifies the Previous Adult Utterance Type and then scores the child's response along two axes: Expansion (contextual elaboration and inferential depth) and Independence (the child's contribution to advancing the discourse). These axes reflect fundamental dimensions in child language development, where Expansion captures elaboration, clause combining, and causal and contrastive connectives. Independence captures initiative, topic control, decreasing reliance on adult scaffolding through growing self-regulation, and audience design. We establish developmental validity by showing age-related patterns and demonstrate predictive value by improving age estimation over common baselines. We further confirm semantic sensitivity by detecting differences tied to discourse relations. Our metrics align with human judgments, enabling large-scale evaluation. This shifts child utterance assessment from simply measuring length to evaluating how meaningfully the child's speech contributes to and advances the conversation within its context.
Related papers
- Can large audio language models understand child stuttering speech? speech summarization, and source separation [3.2684800403907506]
Child speech differs from adult speech in acoustics, prosody, and language development, and disfluencies (repetitions, prolongations, blocks)<n>Recent large audio-language models (LALMs) demonstrate strong cross-modal audio understanding.<n>We evaluate several state-of-the-art LALMs in two settings: an interview (mixed speakers) and a reading task (single child)
arXiv Detail & Related papers (2025-10-21T18:53:34Z) - Evaluating the Sensitivity of LLMs to Harmful Contents in Long Input [53.19281984086319]
Large language models (LLMs) increasingly support applications that rely on extended context, from document processing to retrieval-augmented generation.<n>We evaluate LLMs' sensitivity to harmful content under extended context, varying type (explicit vs. implicit), position (beginning, middle, end), prevalence (0.01-0.50 of the prompt), and context length (600-6000 tokens).<n>We observe similar patterns: performance peaks at moderate harmful prevalence (0.25) but declines when content is very sparse or dominant; recall decreases with increasing context length; harmful sentences at the beginning are generally detected more reliably; and explicit content is more consistently recognized than implicit
arXiv Detail & Related papers (2025-10-07T12:33:21Z) - SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents [72.79816494079833]
Role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance.<n>Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios.<n>We construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations.
arXiv Detail & Related papers (2025-08-04T03:18:36Z) - What Makes a Good Natural Language Prompt? [72.3282960118995]
We conduct a meta-analysis surveying more than 150 prompting-related papers from leading NLP and AI conferences from 2022 to 2025.<n>We propose a property- and human-centric framework for evaluating prompt quality, encompassing 21 properties categorized into six dimensions.<n>We then empirically explore multi-property prompt enhancements in reasoning tasks, observing that single-property enhancements often have the greatest impact.
arXiv Detail & Related papers (2025-06-07T23:19:27Z) - Quantitative Assessment of Intersectional Empathetic Bias and Understanding [0.0]
A growing amount of literature critiques the current operationalizations of empathy based on loose definitions of the construct.
We propose an empathy evaluation framework that operationalizes empathy close to its psychological origins.
arXiv Detail & Related papers (2024-11-08T18:43:15Z) - Can LLMs Understand the Implication of Emphasized Sentences in Dialogue? [64.72966061510375]
Emphasis is a crucial component in human communication, which indicates the speaker's intention and implication beyond pure text in dialogue.
This paper introduces Emphasized-Talk, a benchmark with emphasis-annotated dialogue samples capturing the implications of emphasis.
We evaluate various Large Language Models (LLMs), both open-source and commercial, to measure their performance in understanding emphasis.
arXiv Detail & Related papers (2024-06-16T20:41:44Z) - Context Does Matter: Implications for Crowdsourced Evaluation Labels in Task-Oriented Dialogue Systems [57.16442740983528]
Crowdsourced labels play a crucial role in evaluating task-oriented dialogue systems.
Previous studies suggest using only a portion of the dialogue context in the annotation process.
This study investigates the influence of dialogue context on annotation quality.
arXiv Detail & Related papers (2024-04-15T17:56:39Z) - Natural Language Decompositions of Implicit Content Enable Better Text Representations [52.992875653864076]
We introduce a method for the analysis of text that takes implicitly communicated content explicitly into account.<n>We use a large language model to produce sets of propositions that are inferentially related to the text that has been observed.<n>Our results suggest that modeling the meanings behind observed language, rather than the literal text alone, is a valuable direction for NLP.
arXiv Detail & Related papers (2023-05-23T23:45:20Z) - How Adults Understand What Young Children Say [1.416276307599112]
Children's early speech often bears little resemblance to adult speech in form or content, and yet caregivers often find meaning in young children's utterances.
We propose that successful early communication relies not just on children's growing linguistic knowledge, but also on adults' sophisticated inferences.
arXiv Detail & Related papers (2022-06-15T20:37:32Z) - Deep Learning for Prominence Detection in Children's Read Speech [13.041607703862724]
We consider a labeled dataset of children's reading recordings for the speaker-independent detection of prominent words.
A previous well-tuned random forest ensemble predictor is replaced by an RNN sequence to exploit potential context dependency.
Deep learning is applied to obtain word-level features from low-level acoustic contours of fundamental frequency, intensity and spectral shape.
arXiv Detail & Related papers (2021-04-12T14:15:08Z) - Child-directed Listening: How Caregiver Inference Enables Children's
Early Verbal Communication [2.9331097393290837]
We employ a suite of Bayesian models of spoken word recognition to understand how adults overcome the noisiness of child language.
By evaluating competing models on phonetically-annotated corpora, we show that adults' recovered meanings are best predicted by prior expectations fitted specifically to the child language environment.
arXiv Detail & Related papers (2021-02-06T00:54:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.