Related papers: Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning

Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning

URL: http://arxiv.org/abs/2508.12591v1
Date: Mon, 18 Aug 2025 02:57:43 GMT
Title: Beyond Modality Limitations: A Unified MLLM Approach to Automated Speaking Assessment with Effective Curriculum Learning
Authors: Yu-Hsuan Fang, Tien-Hong Lo, Yao-Ting Sung, Berlin Chen,
Abstract summary: Multimodal Large Language Models (MLLM) offer unprecedented opportunities for comprehensive Automated Speaking Assessment (ASA)<n>We propose Speech-First Multimodal Training (SFMT) to establish more robust modeling foundations of speech before cross-modal synergetic fusion.<n>In particular, SFMT excels in the evaluation of the delivery aspect, achieving an absolute accuracy improvement of 4% over conventional training approaches.
Score: 5.148672971653068
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional Automated Speaking Assessment (ASA) systems exhibit inherent modality limitations: text-based approaches lack acoustic information while audio-based methods miss semantic context. Multimodal Large Language Models (MLLM) offer unprecedented opportunities for comprehensive ASA by simultaneously processing audio and text within unified frameworks. This paper presents a very first systematic study of MLLM for comprehensive ASA, demonstrating the superior performance of MLLM across the aspects of content and language use . However, assessment on the delivery aspect reveals unique challenges, which is deemed to require specialized training strategies. We thus propose Speech-First Multimodal Training (SFMT), leveraging a curriculum learning principle to establish more robust modeling foundations of speech before cross-modal synergetic fusion. A series of experiments on a benchmark dataset show MLLM-based systems can elevate the holistic assessment performance from a PCC value of 0.783 to 0.846. In particular, SFMT excels in the evaluation of the delivery aspect, achieving an absolute accuracy improvement of 4% over conventional training approaches, which also paves a new avenue for ASA.

Related papers

Lost in Modality: Evaluating the Effectiveness of Text-Based Membership Inference Attacks on Large Multimodal Models [3.9448289587779404]
Logit-based membership inference attacks (MIAs) have become a widely adopted approach for assessing data exposure in large language models (LLMs)<n>We present the first comprehensive evaluation of extending these text-based MIA methods to multimodal settings.
arXiv Detail & Related papers (2025-12-02T14:11:51Z)
Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning [10.887230359930697]
We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning.<n>Unlike traditional in-context learning, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning.
arXiv Detail & Related papers (2025-09-22T09:16:34Z)
Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z)
MLLM-CL: Continual Learning for Multimodal Large Language Models [62.90736445575181]
We introduce MLLM-CL, a novel benchmark encompassing domain and ability continual learning.<n>Our approach can integrate domain-specific knowledge and functional abilities with minimal forgetting, significantly outperforming existing methods.
arXiv Detail & Related papers (2025-06-05T17:58:13Z)
From Course to Skill: Evaluating LLM Performance in Curricular Analytics [2.5104969073405976]
Large language models (LLMs) are promising for handling large-scale, unstructured curriculum data.<n>We systematically evaluate four text alignment strategies based on LLMs or traditional NLP methods for skill extraction.<n>Our findings highlight the promise of LLMs in analyzing brief and abstract curriculum documents, but also reveal that their performance can vary significantly depending on model selection and prompting strategies.
arXiv Detail & Related papers (2025-05-05T02:46:23Z)
Evaluating LLM-based Agents for Multi-Turn Conversations: A Survey [64.08485471150486]
This survey examines evaluation methods for large language model (LLM)-based agents in multi-turn conversational settings.<n>We systematically reviewed nearly 250 scholarly sources, capturing the state of the art from various venues of publication.
arXiv Detail & Related papers (2025-03-28T14:08:40Z)
Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition [89.50068130832635]
Self-Improving cognition (SIcog) is a self-learning framework for constructing next-generation foundation MLLMs by multimodal knowledge.<n>We propose Chain-of-Description for step-by-step visual understanding and integrate structured Chain-of-Thought (CoT) reasoning to support in-depth multimodal reasoning.<n>Experiments demonstrate SIcog's effectiveness in developing MLLMs with enhanced multimodal cognition.
arXiv Detail & Related papers (2025-03-16T00:25:13Z)
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.<n>In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.<n>This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z)
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)<n>Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)<n>We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z)
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks. Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs. In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z)
Evaluating Linguistic Capabilities of Multimodal LLMs in the Lens of Few-Shot Learning [15.919493497867567]
This study aims to evaluate the performance of Multimodal Large Language Models (MLLMs) on the VALSE benchmark. We conducted a comprehensive assessment of state-of-the-art MLLMs, varying in model size and pretraining datasets.
arXiv Detail & Related papers (2024-07-17T11:26:47Z)
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences. We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z)
End-to-End Speech Recognition Contextualization with Large Language Models [25.198480789044346]
We introduce a novel method for contextualizing speech recognition models incorporating Large Language Models (LLMs) We provide audio features, along with optional text tokens for context, to train the system to complete transcriptions in a decoder-only fashion. Our empirical results demonstrate a significant improvement in performance, with a 6% WER reduction when additional textual context is provided.
arXiv Detail & Related papers (2023-09-19T20:28:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.