Scalable Frameworks for Real-World Audio-Visual Speech Recognition
- URL: http://arxiv.org/abs/2512.14083v1
- Date: Tue, 16 Dec 2025 04:50:13 GMT
- Title: Scalable Frameworks for Real-World Audio-Visual Speech Recognition
- Authors: Sungnyun Kim,
- Abstract summary: This dissertation aims to build a next-generation, robust, and scalable AVSR system with high reliability in real-world applications.<n>By systematically providing solutions at each of these three levels, this dissertation aims to build a next-generation, robust, and scalable AVSR system.
- Score: 9.825127075279822
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The practical deployment of Audio-Visual Speech Recognition (AVSR) systems is fundamentally challenged by significant performance degradation in real-world environments, characterized by unpredictable acoustic noise and visual interference. This dissertation posits that a systematic, hierarchical approach is essential to overcome these challenges, achieving the robust scalability at the representation, architecture, and system levels. At the representation level, we investigate methods for building a unified model that learns audio-visual features inherently robust to diverse real-world corruptions, thereby enabling generalization to new environments without specialized modules. To address architectural scalability, we explore how to efficiently expand model capacity while ensuring the adaptive and reliable use of multimodal inputs, developing a framework that intelligently allocates computational resources based on the input characteristics. Finally, at the system level, we present methods to expand the system's functionality through modular integration with large-scale foundation models, leveraging their powerful cognitive and generative capabilities to maximize final recognition accuracy. By systematically providing solutions at each of these three levels, this dissertation aims to build a next-generation, robust, and scalable AVSR system with high reliability in real-world applications.
Related papers
- Continual learning and refinement of causal models through dynamic predicate invention [0.6198237241838559]
We propose a framework for constructing symbolic causal world models entirely online.<n>We leverage the power of Meta-Interpretive Learning and predicate invention to find semantically meaningful and reusable abstractions.
arXiv Detail & Related papers (2026-02-19T10:08:31Z) - Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems [75.78934957242403]
Self-driving vehicles and drones require true Spatial Intelligence from multi-modal onboard sensor data.<n>This paper presents a framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal.
arXiv Detail & Related papers (2025-12-30T17:58:01Z) - Multimodal Interpretation of Remote Sensing Images: Dynamic Resolution Input Strategy and Multi-scale Vision-Language Alignment Mechanism [10.17375002962432]
This study proposes a Vision-language Model (VLM) framework integrated with two key innovations.<n>The DRIS adopts a coarse-to-fine approach to adaptively allocate computational resources according to the complexity of image content.<n>The MS-VLAM constructs a three-tier alignment mechanism covering object, local-region and global levels.
arXiv Detail & Related papers (2025-12-29T06:51:20Z) - From Word to World: Can Large Language Models be Implicit Text-based World Models? [82.47317196099907]
Agentic reinforcement learning increasingly relies on experience-driven scaling.<n>World models offer a potential way to improve learning efficiency through simulated experience.<n>We study whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents.
arXiv Detail & Related papers (2025-12-21T17:28:42Z) - Vision-Enhanced Large Language Models for High-Resolution Image Synthesis and Multimodal Data Interpretation [0.0]
This research introduces a transformative framework for integrating Vision-Enhanced Large Language Models (LLMs) with advanced transformer-based architectures.<n>The proposed model incorporates a rectified flow mechanism that connects noise and data with linear paths, enabling efficient and high-quality generation.<n>The framework achieves unparalleled fidelity in synthesized images and coherent multimodal representations.
arXiv Detail & Related papers (2025-12-14T08:28:50Z) - Fun-ASR Technical Report [89.84148151617022]
We present Fun-ASR, a large-scale, LLM-based ASR system that combines massive data, large model capacity, LLM integration, and reinforcement learning.<n>Fun-ASR is specifically optimized for practical deployment, with enhancements in streaming capability, noise robustness, code-switching, hotword customization, and satisfying other real-world application requirements.<n>Thanks to production-oriented optimizations, Fun-ASR achieves state-of-the-art performance on real application datasets, demonstrating its effectiveness and robustness in practical settings.
arXiv Detail & Related papers (2025-09-15T23:19:36Z) - Large-Scale Model Enabled Semantic Communication Based on Robust Knowledge Distillation [45.347078403677216]
Large-scale models (LSMs) can be an effective framework for semantic representation and understanding.<n>However, their direct deployment is often hindered by high computational complexity and resource requirements.<n>This paper proposes a novel knowledge distillation based semantic communication framework.
arXiv Detail & Related papers (2025-08-04T07:47:18Z) - So-Fake: Benchmarking and Explaining Social Media Image Forgery Detection [75.79507634008631]
We introduce So-Fake-Set, a social media-oriented dataset with over 2 million high-quality images, diverse generative sources, and imagery synthesized using 35 state-of-the-art generative models.<n>We present So-Fake-R1, an advanced vision-language framework that employs reinforcement learning for highly accurate forgery detection, precise localization, and explainable inference through interpretable visual rationales.
arXiv Detail & Related papers (2025-05-24T11:53:35Z) - AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection [57.649223695021114]
We present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection.<n>Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy.
arXiv Detail & Related papers (2025-04-06T13:59:16Z) - LLM-Ehnanced Holonic Architecture for Ad-Hoc Scalable SoS [3.591449065638895]
We propose a layered architecture for holons, which includes reasoning, communication, and capabilities layers.<n>Second, inspired by principles of intelligent manufacturing, we introduce specialised holons namely, supervisor, planner, task, and resource holons.<n>These specialised holons utilise large language models within their reasoning layers to support decision making and ensure real time adaptability.
arXiv Detail & Related papers (2025-01-14T10:35:54Z) - The OCON model: an old but green solution for distributable supervised classification for acoustic monitoring in smart cities [0.28675177318965045]
This paper focuses on vowel phonemes classification and speakers recognition for the Automatic Speech Recognition domain.
For our case-study, the ASR model runs on a proprietary sensing and lightning system, exploited to monitor acoustic and air pollution on urban streets.
We formalize combinations of pseudo-Neural Architecture Search and Hyper-s Tuning experiments, using an informed grid-search methodology, to achieve classification accuracy comparable to nowadays most complex architectures.
arXiv Detail & Related papers (2024-10-05T09:47:54Z) - Efficient High-Resolution Visual Representation Learning with State Space Model for Human Pose Estimation [60.80423207808076]
Capturing long-range dependencies while preserving high-resolution visual representations is crucial for dense prediction tasks such as human pose estimation.<n>We propose the Dynamic Visual State Space (DVSS) block, which augments visual state space models with multi-scale convolutional operations.<n>We build HRVMamba, a novel model for efficient high-resolution representation learning.
arXiv Detail & Related papers (2024-10-04T06:19:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.