Beamforming-LLM: What, Where and When Did I Miss?
- URL: http://arxiv.org/abs/2509.06221v1
- Date: Sun, 07 Sep 2025 21:52:26 GMT
- Title: Beamforming-LLM: What, Where and When Did I Miss?
- Authors: Vishal Choudhari,
- Abstract summary: We present Beamforming-LLM, a system that enables users to semantically recall conversations they may have missed in multi-speaker environments.<n>The system combines spatial audio capture using a microphone array with retrieval-augmented generation (RAG) to support natural language queries.
- Score: 0.6655749439594806
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Beamforming-LLM, a system that enables users to semantically recall conversations they may have missed in multi-speaker environments. The system combines spatial audio capture using a microphone array with retrieval-augmented generation (RAG) to support natural language queries such as, "What did I miss when I was following the conversation on dogs?" Directional audio streams are separated using beamforming, transcribed with Whisper, and embedded into a vector database using sentence encoders. Upon receiving a user query, semantically relevant segments are retrieved, temporally aligned with non-attended segments, and summarized using a lightweight large language model (GPT-4o-mini). The result is a user-friendly interface that provides contrastive summaries, spatial context, and timestamped audio playback. This work lays the foundation for intelligent auditory memory systems and has broad applications in assistive technology, meeting summarization, and context-aware personal spatial computing.
Related papers
- LongAudio-RAG: Event-Grounded Question Answering over Multi-Hour Long Audio [6.935416517354558]
LongAudio Generation (LARAG) is a framework that grounds Large Language Model (LLM) outputs in retrieved, timestamped acoustic event detections.<n>We demonstrate the practicality of our approach by deploying it in a hybrid edge-cloud environment.<n>Experiments show that structured, event-level retrieval significantly improves accuracy compared to vanilla Retrieval-Augmentment.
arXiv Detail & Related papers (2026-02-16T10:15:22Z) - VIBEVOICE-ASR Technical Report [95.57263110940973]
VibeVoice-ASR addresses challenges of context fragmentation and multi-speaker complexity in long-form audio.<n>It supports over 50 languages, requires no explicit language setting, and handles code-switching within and across utterances.
arXiv Detail & Related papers (2026-01-26T06:11:51Z) - Spatial Audio Motion Understanding and Reasoning [8.029049649310211]
spatial audio reasoning enables machines to interpret auditory scenes by understanding events and their spatial attributes.<n>We introduce a spatial audio encoder that processes spatial audio to detect multiple overlapping events and estimate their spatial attributes, Direction of Arrival (DoA) and source distance, at the frame level.<n>Second, to answer complex queries about dynamic audio scenes involving moving sources, we condition a large language model (LLM) on structured spatial attributes extracted by our model.
arXiv Detail & Related papers (2025-09-18T06:53:22Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - Spatial Audio Processing with Large Language Model on Wearable Devices [6.345647878712574]
We present a novel system architecture that incorporates spatial speech understanding into large language models (LLMs)<n>SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of $25.72circ$-a substantial improvement compared to the 88.52circ$ median error in existing work-with a word error rate (WER) of 5.3.<n>SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median DoA error of 16$circ$.
arXiv Detail & Related papers (2025-04-11T18:19:59Z) - Large Language Model Can Transcribe Speech in Multi-Talker Scenarios with Versatile Instructions [68.98811048970963]
We present a pioneering effort to investigate the capability of large language models (LLMs) in transcribing speech in multi-talker environments.<n>We use WavLM and Whisper encoder to extract multi-faceted speech representations that are sensitive to speaker characteristics and semantic context.<n>Experiments reveal the promising performance of our proposed system, MT-LLM, in cocktail party scenarios.
arXiv Detail & Related papers (2024-09-13T07:28:28Z) - DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding [51.32965203977845]
We propose the use of discrete speech units (DSU) instead of continuous-valued speech encoder outputs.
The proposed model shows robust performance on speech inputs from seen/unseen domains and instruction-following capability in spoken question answering.
Our findings suggest that the ASR task and datasets are not crucial in instruction-tuning for spoken question answering tasks.
arXiv Detail & Related papers (2024-06-13T17:28:13Z) - ASoBO: Attentive Beamformer Selection for Distant Speaker Diarization in Meetings [4.125756306660331]
Speaker Diarization (SD) aims at grouping speech segments that belong to the same speaker.
Beamforming, i.e., spatial filtering, is a common practice to process multi-microphone audio data.
This paper proposes a self-attention-based algorithm to select the output of a bank of fixed spatial filters.
arXiv Detail & Related papers (2024-06-05T13:28:28Z) - Multi-microphone Automatic Speech Segmentation in Meetings Based on
Circular Harmonics Features [0.0]
We propose a new set of spatial features based on direction-of-arrival estimations in the circular harmonic domain (CH-DOA)
Experiments on the AMI meeting corpus show that CH-DOA can improve the segmentation while being robust in the case of deactivated microphones.
arXiv Detail & Related papers (2023-06-07T09:09:00Z) - ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition [100.30565531246165]
Speech recognition systems require dataset-specific tuning.
This tuning requirement can lead to systems failing to generalise to other datasets and domains.
We introduce the End-to-end Speech Benchmark (ESB) for evaluating the performance of a single automatic speech recognition system.
arXiv Detail & Related papers (2022-10-24T15:58:48Z) - Audio-text Retrieval in Context [24.38055340045366]
In this work, we investigate several audio features as well as sequence aggregation methods for better audio-text alignment.
We build our contextual audio-text retrieval system using pre-trained audio features and a descriptor-based aggregation method.
With our proposed system, a significant improvement has been achieved on bidirectional audio-text retrieval, on all metrics including recall, median and mean rank.
arXiv Detail & Related papers (2022-03-25T13:41:17Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - Continuous speech separation: dataset and analysis [52.10378896407332]
In natural conversations, a speech signal is continuous, containing both overlapped and overlap-free components.
This paper describes a dataset and protocols for evaluating continuous speech separation algorithms.
arXiv Detail & Related papers (2020-01-30T18:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.