Related papers: Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations

Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations

URL: http://arxiv.org/abs/2510.24250v1
Date: Tue, 28 Oct 2025 10:00:52 GMT
Title: Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations
Authors: Syed Zohaib Hassan, Pål Halvorsen, Miriam S. Johnson, Pierre Lison,
Abstract summary: Large Language Models (LLMs) predominantly trained on adult conversational data, face challenges when generating authentic, child-like dialogue for specialized applications.<n>We present a comparative study evaluating five different LLMs to generate age-appropriate Norwegian conversations for children aged 5 and 9 years.
Score: 3.660458463669403
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs), predominantly trained on adult conversational data, face significant challenges when generating authentic, child-like dialogue for specialized applications. We present a comparative study evaluating five different LLMs (GPT-4, RUTER-LLAMA-2-13b, GPTSW, NorMistral-7b, and NorBloom-7b) to generate age-appropriate Norwegian conversations for children aged 5 and 9 years. Through a blind evaluation by eleven education professionals using both real child interview data and LLM-generated text samples, we assessed authenticity and developmental appropriateness. Our results show that evaluators achieved strong inter-rater reliability (ICC=0.75) and demonstrated higher accuracy in age prediction for younger children (5-year-olds) compared to older children (9-year-olds). While GPT-4 and NorBloom-7b performed relatively well, most models generated language perceived as more linguistically advanced than the target age groups. These findings highlight critical data-related challenges in developing LLM systems for specialized applications involving children, particularly in low-resource languages where comprehensive age-appropriate lexical resources are scarce.

Related papers

FEANEL: A Benchmark for Fine-Grained Error Analysis in K-12 English Writing [68.23874413455594]
We present the Fine-grained Error ANalysis for English learners (FEANEL) Benchmark.<n>The benchmark comprises 1,000 essays written by elementary and secondary school students.<n>Each error is annotated by language education experts and categorized by type, severity, and explanatory feedback, using a part-of-speech-based taxonomy they co-developed.
arXiv Detail & Related papers (2025-11-28T05:17:45Z)
Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL [64.3268313484078]
Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare.<n>Their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns.<n>We investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception.
arXiv Detail & Related papers (2025-10-16T05:29:36Z)
MENLO: From Preferences to Proficiency - Evaluating and Modeling Native-like Quality Across 47 Languages [18.278876042011383]
We introduce MENLO, a framework that operationalizes the evaluation of native-like response quality based on audience-inspired mechanisms.<n>We create a dataset of 6,423 human-annotated prompt-response preference pairs covering four quality dimensions with high inter-annotator agreement in 47 language varieties.
arXiv Detail & Related papers (2025-09-30T17:48:58Z)
Benchmarking LLMs for Mimicking Child-Caregiver Language in Interaction [6.152274140650429]
LLMs can generate human-like dialogues, yet their ability to simulate early child-adult interactions remains largely unexplored.<n>We found that state-of-the-art LLMs can approximate child- caregiver dialogues at the word and utterance level, but they struggle to reproduce the child and caregiver's discursive patterns, exaggerate alignment, and fail to reach the level of diversity shown by humans.
arXiv Detail & Related papers (2024-12-12T14:43:03Z)
Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability to Mark Short Answer Questions in K-12 Education [0.0]
This paper presents reports on a series of experiments with a novel dataset evaluating how well Large Language Models can mark (i.e. grade) open text responses to short answer questions. We found that GPT-4, with basic few-shot prompting performed well (Kappa, 0.70) and, importantly, very close to human-level performance (0.75) This research builds on prior findings that GPT-4 could reliably score short answer reading comprehension questions at a performance-level very close to that of expert human raters.
arXiv Detail & Related papers (2024-05-05T16:11:06Z)
LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z)
"Knowing When You Don't Know": A Multilingual Relevance Assessment Dataset for Robust Retrieval-Augmented Generation [90.09260023184932]
Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce factual hallucinations. NoMIRACL is a human-annotated dataset for evaluating LLM robustness in RAG across 18 typologically diverse languages. We measure relevance assessment using: (i) hallucination rate, measuring model tendency to hallucinate, when the answer is not present in passages in the non-relevant subset, and (ii) error rate, measuring model inaccuracy to recognize relevant passages in the relevant subset.
arXiv Detail & Related papers (2023-12-18T17:18:04Z)
Evaluating Neural Language Models as Cognitive Models of Language Acquisition [4.779196219827507]
We argue that some of the most prominent benchmarks for evaluating the syntactic capacities of neural language models may not be sufficiently rigorous. When trained on small-scale data modeling child language acquisition, the LMs can be readily matched by simple baseline models. We conclude with suggestions for better connecting LMs with the empirical study of child language acquisition.
arXiv Detail & Related papers (2023-10-31T00:16:17Z)
Benchmarking Large Language Models in Retrieval-Augmented Generation [53.504471079548]
We systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG. We establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese.
arXiv Detail & Related papers (2023-09-04T08:28:44Z)
L-Eval: Instituting Standardized Evaluation for Long Context Language Models [91.05820785008527]
We propose L-Eval to institute a more standardized evaluation for long context language models (LCLMs) We build a new evaluation suite containing 20 sub-tasks, 508 long documents, and over 2,000 human-labeled query-response pairs. Results show that popular n-gram matching metrics generally can not correlate well with human judgment.
arXiv Detail & Related papers (2023-07-20T17:59:41Z)
CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity and Infant Care [14.326936563564171]
We present a benchmark, CARE-MI, for evaluating misinformation in large language models (LLMs) Our proposed benchmark fills the gap between the extensive usage of LLMs and the lack of datasets for assessing the misinformation generated by these models. Using our benchmark, we conduct extensive experiments and found that current Chinese LLMs are far from perfect in the topic of maternity and infant care.
arXiv Detail & Related papers (2023-07-04T03:34:19Z)
Can Large Language Models Be an Alternative to Human Evaluations? [80.81532239566992]
Large language models (LLMs) have demonstrated exceptional performance on unseen tasks when only the task instructions are provided. We show that the result of LLM evaluation is consistent with the results obtained by expert human evaluation.
arXiv Detail & Related papers (2023-05-03T07:28:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.