Related papers: EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning

URL: http://arxiv.org/abs/2505.04623v1
Date: Wed, 07 May 2025 17:59:49 GMT
Title: EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning
Authors: Zhenghao Xing, Xiaowei Hu, Chi-Wing Fu, Wenhai Wang, Jifeng Dai, Pheng-Ann Heng,
Abstract summary: Multimodal large language models (MLLMs) have advanced perception across text, vision, and audio, yet struggle with structured cross-modal reasoning.<n>We introduce EchoInk-R1, a reinforcement learning framework that enhances such reasoning in MLLMs.
Score: 108.73513190593232
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) have advanced perception across text, vision, and audio, yet they often struggle with structured cross-modal reasoning, particularly when integrating audio and visual signals. We introduce EchoInk-R1, a reinforcement learning framework that enhances such reasoning in MLLMs. Built upon the Qwen2.5-Omni-7B foundation and optimized with Group Relative Policy Optimization (GRPO), EchoInk-R1 tackles multiple-choice question answering over synchronized audio-image pairs. To enable this, we curate AVQA-R1-6K, a dataset pairing such audio-image inputs with multiple-choice questions derived from OmniInstruct-v1. EchoInk-R1-7B achieves 85.77% accuracy on the validation set, outperforming the base model, which scores 80.53%, using only 562 reinforcement learning steps. Beyond accuracy, EchoInk-R1 demonstrates reflective reasoning by revisiting initial interpretations and refining responses when facing ambiguous multimodal inputs. These results suggest that lightweight reinforcement learning fine-tuning enhances cross-modal reasoning in MLLMs. EchoInk-R1 is the first framework to unify audio, visual, and textual modalities for general open-world reasoning via reinforcement learning. Code and data are publicly released to facilitate further research.

Related papers

ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z)
VTPerception-R1: Enhancing Multimodal Reasoning via Explicit Visual and Textual Perceptual Grounding [53.00016784065408]
Multimodal large language models (MLLMs) often struggle to ground reasoning in perceptual evidence.<n>We present a systematic study of perception strategies-explicit, implicit, visual, and textual-across four multimodal benchmarks and two MLLMs.<n>We propose VTPerception-R1, a unified two-stage framework that decouples perception from reasoning.
arXiv Detail & Related papers (2025-09-29T13:40:34Z)
SightSound-R1: Cross-Modal Reasoning Distillation from Vision to Audio Language Models [18.802543558300044]
We present SightSound-R1, a cross-modal distillation framework that transfers advanced reasoning from a stronger LVLM teacher to a weaker LALM student.<n>Results show that SightSound-R1 improves LALM reasoning performance both in the in-domain AVQA test set and in unseen auditory scenes and questions.
arXiv Detail & Related papers (2025-09-19T06:39:39Z)
SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models [25.143840124269193]
We introduce the Audio Logical Reasoning dataset, consisting of 6,446 text-audio annotated samples.<n>We then propose SoundMind, a rule-based reinforcement learning algorithm tailored to endow ALMs with deep bimodal reasoning abilities.<n>Our approach achieves state-of-the-art performance in audio logical reasoning.
arXiv Detail & Related papers (2025-06-15T18:26:08Z)
Perceptual Decoupling for Scalable Multi-modal Reasoning via Reward-Optimized Captioning [78.17782197231325]
We propose a reasoning-guided reinforcement learning strategy that aligns the extractor's captioning behavior with the reasoning objective.<n> Experiments on multi-modal math and science benchmarks show that the proposed RACRO method achieves state-of-the-art average performance.
arXiv Detail & Related papers (2025-06-05T02:28:07Z)
Fork-Merge Decoding: Enhancing Multimodal Understanding in Audio-Visual Large Language Models [13.887164304514101]
The goal of this work is to enhance balanced multimodal understanding in audio-visual large language models (AV-LLMs)<n>In current AV-LLMs, audio and video features are typically processed jointly in the decoder.<n>We propose Fork-Merge Decoding (FMD), a simple yet effective inference-time strategy that requires no additional training or architectural modifications.
arXiv Detail & Related papers (2025-05-27T08:22:56Z)
Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering [22.88876323500893]
reinforcement learning (RL) has been shown to greatly enhance the reasoning capabilities of large language models (LLMs)<n>We conduct a series of RL explorations in audio understanding and reasoning, specifically focusing on the audio question answering (AQA) task.<n>Our experiments demonstrated state-of-the-art performance on the MMAU Test-mini benchmark, achieving an accuracy rate of 64.5%.
arXiv Detail & Related papers (2025-03-14T08:43:53Z)
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning [50.419872452397684]
Search-R1 is an extension of reinforcement learning for reasoning frameworks.<n>It generates search queries during step-by-step reasoning with real-time retrieval.<n>It improves performance by 41% (Qwen2.5-7B) and 20% (Qwen2.5-3B) over various RAG baselines.
arXiv Detail & Related papers (2025-03-12T16:26:39Z)
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models [24.45348222168512]
We propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability.<n>Our model achieves an average improvement of $sim$6% across various multimodal math reasoning benchmarks.<n>Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1.
arXiv Detail & Related papers (2025-03-09T20:06:45Z)
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information? [65.49972312524724]
multimodal large language models (MLLMs) have expanded their capabilities to include vision and audio modalities.<n>Our proposed DeafTest reveals that MLLMs often struggle with simple tasks humans find trivial.<n>We introduce AV-Odyssey Bench, a comprehensive audio-visual benchmark designed to assess whether those MLLMs can truly understand the audio-visual information.
arXiv Detail & Related papers (2024-12-03T17:41:23Z)
Large Language Models are Strong Audio-Visual Speech Recognition Learners [53.142635674428874]
Multimodal large language models (MLLMs) have recently become a focal point of research due to their formidable multimodal understanding capabilities.<n>We propose Llama-AVSR, a new MLLM with strong audio-visual speech recognition capabilities.<n>We evaluate our proposed approach on LRS3, the largest public AVSR benchmark, and we achieve new state-of-the-art results for the tasks of ASR and AVSR with a WER of 0.79% and 0.77%, respectively.
arXiv Detail & Related papers (2024-09-18T21:17:27Z)
Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm. Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation. Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.