Related papers: SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models

SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models

URL: http://arxiv.org/abs/2509.15661v1
Date: Fri, 19 Sep 2025 06:39:39 GMT
Title: SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models
Authors: Qiaolin Wang, Xilin Jiang, Linyang He, Junkai Wu, Nima Mesgarani,
Abstract summary: We present SightSound-R1, a cross-modal distillation framework that transfers advanced reasoning from a stronger LVLM teacher to a weaker LALM student.<n>Results show that SightSound-R1 improves LALM reasoning performance both in the in-domain AVQA test set and in unseen auditory scenes and questions.
Score: 18.802543558300044
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: While large audio-language models (LALMs) have demonstrated state-of-the-art audio understanding, their reasoning capability in complex soundscapes still falls behind large vision-language models (LVLMs). Compared to the visual domain, one bottleneck is the lack of large-scale chain-of-thought audio data to teach LALM stepwise reasoning. To circumvent this data and modality gap, we present SightSound-R1, a cross-modal distillation framework that transfers advanced reasoning from a stronger LVLM teacher to a weaker LALM student on the same audio-visual question answering (AVQA) dataset. SightSound-R1 consists of three core steps: (i) test-time scaling to generate audio-focused chains of thought (CoT) from an LVLM teacher, (ii) audio-grounded validation to filter hallucinations, and (iii) a distillation pipeline with supervised fine-tuning (SFT) followed by Group Relative Policy Optimization (GRPO) for the LALM student. Results show that SightSound-R1 improves LALM reasoning performance both in the in-domain AVQA test set as well as in unseen auditory scenes and questions, outperforming both pretrained and label-only distilled baselines. Thus, we conclude that vision reasoning can be effectively transferred to audio models and scaled with abundant audio-visual data.

Related papers

Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning [39.264735719707154]
Current efforts replicate text-based reasoning by contextualizing audio content through a one-time encoding.<n>We propose audio-interleaved reasoning to break through this bottleneck.<n>We present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning.
arXiv Detail & Related papers (2026-02-12T13:06:34Z)
When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion? [41.579901082251254]
Experimental results reveal that Multimodal Large Language Models (MLLMs) struggle to discriminate non-existent audio due to visually dominated reasoning.<n>Motivated by this observation, we introduce RL-CoMM, a Reinforcement Learning-based Collaborative Multi-MLLM.<n> RL-CoMM includes two stages: 1) To alleviate visually dominated ambiguities, we introduce an external model, a Large Audio Language Model (LALM), as the reference model to generate audio-only reasoning.
arXiv Detail & Related papers (2025-11-13T07:59:41Z)
PAL: Probing Audio Encoders via LLMs -- A Study of Information Transfer from Audio Encoders to LLMs [16.820927353576774]
The integration of audio perception capabilities into Large Language Models (LLMs) has enabled significant advances in Audio-LLMs.<n>We conceptualize effective audio-LLM interaction as the LLM's ability to proficiently probe the audio encoder representations to satisfy textual queries.<n>This paper presents a systematic investigation on how architectural design choices can affect that.
arXiv Detail & Related papers (2025-06-12T07:23:07Z)
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z)
FlowDubber: Movie Dubbing with LLM-based Semantic-aware Learning and Flow Matching based Voice Enhancing [81.3306413498174]
Movie Dubbing aims to convert scripts into speeches that align with the given movie clip in both temporal and emotional aspects.<n>Existing methods focus primarily on reducing the word error rate while ignoring the importance of lip-sync and acoustic quality.<n>We propose a large language model (LLM) based flow matching architecture for dubbing, named FlowDubber.
arXiv Detail & Related papers (2025-05-02T13:30:19Z)
Adaptive Audio-Visual Speech Recognition via Matryoshka-Based Multimodal LLMs [33.12165044958361]
Recent advances in Large Language Models (LLMs) show strong performance in speech recognition, including Audio-Visual Speech Recognition (AVSR)<n>To address this, we propose Llama-MTSK, the first Matryoshka-based Multimodal LLM for AVSR.<n>Inspired by Matryoshka Representation Learning, our model encodes representations at multiple granularities with a single architecture.<n>For efficient fine-tuning, we introduce three LoRA-based strategies using global and scale-specific modules.
arXiv Detail & Related papers (2025-03-09T00:02:10Z)
Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.<n>We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.<n>We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z)
Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models [25.660343393359565]
This paper proposes a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal large language models (LLM) FAVOR simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level. An interactive demo of FAVOR is available at https://github.com/BriansIDP/AudioVisualLLM.git, and the training code and model checkpoints will be released soon.
arXiv Detail & Related papers (2023-10-09T17:00:20Z)
Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping [4.271091833712731]
We propose a simple approach, named Lip2Vec that is based on learning a prior model. The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER. We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading.
arXiv Detail & Related papers (2023-08-11T12:59:02Z)
Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation [18.001730255429347]
Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. We propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. Experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2023-04-06T09:54:06Z)
AVFormer: Injecting Vision into Frozen Speech Models for Zero-Shot AV-ASR [79.21857972093332]
We present AVFormer, a method for augmenting audio-only models with visual information, at the same time performing lightweight domain adaptation. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively.
arXiv Detail & Related papers (2023-03-29T07:24:28Z)
MAViL: Masked Audio-Video Learners [68.61844803682145]
We present Masked Audio-Video learners (MAViL) to train audio-visual representations. Pre-training with MAViL enables the model to perform well in audio-visual classification and retrieval tasks. For the first time, a self-supervised audio-visual model outperforms ones that use external supervision on benchmarks.
arXiv Detail & Related papers (2022-12-15T18:59:59Z)
Jointly Learning Visual and Auditory Speech Representations from Raw Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations. Our design is asymmetric w.r.t. driven by the inherent differences between video and audio. RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.