Related papers: Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-resource Speech Recognition

Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-resource Speech Recognition

URL: http://arxiv.org/abs/2602.09043v1
Date: Wed, 04 Feb 2026 06:01:30 GMT
Title: Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-resource Speech Recognition
Authors: Aditya Srinivas Menon, Kumud Tripathi, Raj Gohil, Pankaj Wasnik,
Abstract summary: We introduce Windowed SummaryMixing (WSM), which enhances SummaryMixing (SM)<n>WSM integrates local neighborhood summaries alongside the global summary, maintaining efficiency while improving temporal dependencies.<n>Our approach improves ASR performance while reducing peak VRAM usage by 40% in the SSL models.
Score: 10.177623104133023
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-supervised learning (SSL) has advanced speech processing but suffers from quadratic complexity due to self-attention. To address this, SummaryMixing (SM) has been proposed as a linear-time alternative that summarizes entire utterances using mean pooling but lacks sufficient local context. In this work, we introduce Windowed SummaryMixing (WSM), which enhances SM by integrating local neighborhood summaries alongside the global summary, maintaining efficiency while improving temporal dependencies. Additionally, we introduce a selective fine-tuning approach, replacing self-attention layers in SSL models with WSM blocks and fine-tuning only these blocks in low-resource settings. Our approach improves ASR performance while reducing peak VRAM usage by 40\% in the SSL models. WSM blocks have linear-time complexity with enhanced context awareness. Selectively replacing some attention layers reduces compute, memory, and latency, making it ideal for low-resource speech recognition.

Related papers

From Passive Perception to Active Memory: A Weakly Supervised Image Manipulation Localization Framework Driven by Coarse-Grained Annotations [14.0185129202898]
BoxPromptIML is a novel weakly-supervised IML framework that balances annotation cost and localization performance.<n>Inspired by the human subconscious memory mechanism, our feature fusion module employs a dual-guidance strategy that actively contextualizes recalled patterns with real-time observational cues.
arXiv Detail & Related papers (2025-11-25T14:39:17Z)
Omni-AVSR: Towards Unified Multimodal Speech Recognition with Large Language Models [34.15708407614003]
Large language models (LLMs) have recently achieved impressive results in speech recognition across multiple modalities.<n>We present Omni-AVSR, a unified audio-visual LLM that combines efficient multi-granularity training with parameter-efficient adaptation.<n> Experiments on LRS2 and LRS3 show that Omni-AVSR achieves comparable or superior accuracy to state-of-the-art baselines.
arXiv Detail & Related papers (2025-11-10T16:03:44Z)
MergeMix: A Unified Augmentation Paradigm for Visual and Multi-Modal Understanding [23.96717124380285]
MergeMix is a training-time augmentation paradigm that bridges SFT and RL.<n>It first applies an attention-aware image mixing via token merge with more cluster representation and spatial context.<n>It then presents a preference-driven training paradigm for MLLMs by building preference pairs with mixed images and raw images, and optimizing via SimPO loss.
arXiv Detail & Related papers (2025-10-27T16:12:40Z)
Language Ranker: A Lightweight Ranking framework for LLM Decoding [70.01564145836129]
This paper conceptualizes the decoding process as analogous to the ranking stage in recommendation pipelines.<n>Motivated by this insight, we propose Language Ranker, a novel framework that introduces a lightweight module to rerank candidate responses.<n> Experiments show that Language Ranker achieves performance comparable to large-scale reward models, while requiring only 0.5M additional parameters.
arXiv Detail & Related papers (2025-10-23T17:56:46Z)
Federated Learning-Enabled Hybrid Language Models for Communication-Efficient Token Transmission [87.68447072141402]
Hybrid Language Models (HLMs) combine the low-latency efficiency of Small Language Models (SLMs) on edge devices with the high accuracy of Large Language Models (LLMs) on centralized servers.<n>We propose FedHLM, a communication-efficient HLM framework that integrates uncertainty-aware inference with Federated Learning (FL)
arXiv Detail & Related papers (2025-06-30T02:56:11Z)
LatentLLM: Attention-Aware Joint Tensor Compression [50.33925662486034]
Large language models (LLMs) and large multi-modal models (LMMs) require a massive amount of computational and memory resources.<n>We propose a new framework to convert such LLMs/LMMs into a reduced-dimension latent structure.
arXiv Detail & Related papers (2025-05-23T22:39:54Z)
Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation [36.46163240168576]
Open-vocabulary semantic segmentation (OVSS) is an open-world task that aims to assign each pixel within an image to a specific class defined by arbitrary text descriptions.<n>Recent advancements in large-scale vision-language models have demonstrated their open-vocabulary understanding capabilities.<n>This study introduces ERR-Seg, a novel framework that effectively reduces redundancy to balance accuracy and efficiency.
arXiv Detail & Related papers (2025-01-29T13:24:53Z)
How to Learn a New Language? An Efficient Solution for Self-Supervised Learning Models Unseen Languages Adaption in Low-Resource Scenario [72.02391485962127]
Speech Self-Supervised Learning (SSL) models achieve impressive performance on Automatic Speech Recognition (ASR)<n>In low-resource language ASR, they encounter the domain mismatch problem between pre-trained and low-resource languages.<n>We extend a conventional efficient fine-tuning scheme based on the adapter to handle these issues.
arXiv Detail & Related papers (2024-11-27T10:51:00Z)
SMILE: Speech Meta In-Context Learning for Low-Resource Language Automatic Speech Recognition [55.2480439325792]
Speech Meta In-Context LEarning (SMILE) is an innovative framework that combines meta-learning with speech in-context learning (SICL)<n>We show that SMILE consistently outperforms baseline methods in training-free few-shot multilingual ASR tasks.
arXiv Detail & Related papers (2024-09-16T16:04:16Z)
R-SFLLM: Jamming Resilient Framework for Split Federated Learning with Large Language Models [65.04475956174959]
Split federated learning (SFL) is a compute-efficient paradigm in distributed machine learning (ML)<n>A significant challenge in SFL, particularly when deployed over wireless channels, is the susceptibility of transmitted model parameters to adversarial jamming.<n>This paper develops a physical layer framework for resilient SFL with large language models (LLMs) and vision language models (VLMs) over wireless networks.
arXiv Detail & Related papers (2024-07-16T12:21:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.