Related papers: LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models

LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models

URL: http://arxiv.org/abs/2308.10390v4
Date: Thu, 18 Apr 2024 08:13:58 GMT
Title: LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models
Authors: Zihan Zhao, Yiyang Jiang, Heyang Liu, Yanfeng Wang, Yu Wang,
Abstract summary: We propose a lightweight, end-to-end framework to execute the Spoken Question Answering (SQA) task on the LibriSQA dataset. By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks. Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs.
Score: 21.95962189710859
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Large Language Models (LLMs) have demonstrated commendable performance across a myriad of domains and tasks, existing LLMs still exhibit a palpable deficit in handling multimodal functionalities, especially for the Spoken Question Answering (SQA) task which necessitates precise alignment and deep interaction between speech and text features. To address the SQA challenge on LLMs, we initially curated the free-form and open-ended LibriSQA dataset from Librispeech, comprising Part I with natural conversational formats and Part II encompassing multiple-choice questions followed by answers and analytical segments. Both parts collectively include 107k SQA pairs that cover various topics. Given the evident paucity of existing speech-text LLMs, we propose a lightweight, end-to-end framework to execute the SQA task on the LibriSQA, witnessing significant results. By reforming ASR into the SQA format, we further substantiate our framework's capability in handling ASR tasks. Our empirical findings bolster the LLMs' aptitude for aligning and comprehending multimodal information, paving the way for the development of universal multimodal LLMs. The dataset and demo can be found at https://github.com/ZihanZhaoSJTU/LibriSQA.

Related papers

SpokenNativQA: Multilingual Everyday Spoken Queries for LLMs [12.60449414234283]
SpokenNativQA is the first multilingual and culturally aligned spoken question-answering dataset.<n>The dataset comprises approximately 33,000 naturally spoken questions and answers in multiple languages.
arXiv Detail & Related papers (2025-05-25T14:22:18Z)
New Dataset and Methods for Fine-Grained Compositional Referring Expression Comprehension via Specialist-MLLM Collaboration [49.180693704510006]
Referring Expression (REC) is a cross-modal task that evaluates the interplay of language understanding, image comprehension, and language-to-image grounding.<n>It serves as an essential testing ground for Multimodal Large Language Models (MLLMs)
arXiv Detail & Related papers (2025-02-27T13:58:44Z)
News Reporter: A Multi-lingual LLM Framework for Broadcast T.V News [3.4502293745974906]
Large Language Models (LLMs) have fast become an essential tools to many conversational chatbots due to their ability to provide coherent answers for varied queries. We collect and share a large collection of QA pairs extracted from news recordings from various news-channels across the United States. We propose a RAG method to improve contextualization of our answers and also point it to a verifiable news recording.
arXiv Detail & Related papers (2024-10-10T01:21:48Z)
Assessing SPARQL capabilities of Large Language Models [0.0]
We focus on measuring out-of-the box capabilities of Large Language Models to work with SPARQL. We implement benchmarking tasks in the LLM-KG-Bench framework for automated execution and evaluation. Our findings indicate that working with SPARQL SELECT queries is still challenging for LLMs.
arXiv Detail & Related papers (2024-09-09T08:29:39Z)
IDEAL: Leveraging Infinite and Dynamic Characterizations of Large Language Models for Query-focused Summarization [59.06663981902496]
Query-focused summarization (QFS) aims to produce summaries that answer particular questions of interest, enabling greater user control and personalization. We investigate two indispensable characteristics that the LLMs-based QFS models should be harnessed, Lengthy Document Summarization and Efficiently Fine-grained Query-LLM Alignment. These innovations pave the way for broader application and accessibility in the field of QFS technology.
arXiv Detail & Related papers (2024-07-15T07:14:56Z)
An End-to-End Speech Summarization Using Large Language Model [7.562198375754054]
Speech Summarization (SSum) aims to generate human-like text summaries from spoken content. Research on large language models (LLMs) and multimodal information fusion has provided new insights. We propose an end-to-end SSum model that utilizes Q-Former as a connector for the audio-text modality.
arXiv Detail & Related papers (2024-07-02T07:22:57Z)
Crafting Interpretable Embeddings by Asking LLMs Questions [89.49960984640363]
Large language models (LLMs) have rapidly improved text embeddings for a growing array of natural-language processing tasks. We introduce question-answering embeddings (QA-Emb), embeddings where each feature represents an answer to a yes/no question asked to an LLM. We use QA-Emb to flexibly generate interpretable models for predicting fMRI voxel responses to language stimuli.
arXiv Detail & Related papers (2024-05-26T22:30:29Z)
What Large Language Models Bring to Text-rich VQA? [38.569505870771025]
Text-rich VQA, namely Visual Question Answering based on text recognition in the images, is a cross-modal task that requires both image comprehension and text recognition. To address the above concern, we leverage external OCR models to recognize texts in the image and Large Language Models (LLMs) to answer the question given texts. This pipeline achieved superior performance compared to the majority of existing Multimodal Large Language Models (MLLM) on four text-rich VQA datasets.
arXiv Detail & Related papers (2023-11-13T12:52:29Z)
SEMQA: Semi-Extractive Multi-Source Question Answering [94.04430035121136]
We introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. We create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions.
arXiv Detail & Related papers (2023-11-08T18:46:32Z)
MMHQA-ICL: Multimodal In-context Learning for Hybrid Question Answering over Text, Tables and Images [24.17147521556083]
In-context learning has become the most popular way to solve QA problems. We propose MMHQA-ICL framework for addressing this problems. We are the first to use end-to-end prompting method for this task.
arXiv Detail & Related papers (2023-09-09T13:35:01Z)
RET-LLM: Towards a General Read-Write Memory for Large Language Models [53.288356721954514]
RET-LLM is a novel framework that equips large language models with a general write-read memory unit. Inspired by Davidsonian semantics theory, we extract and save knowledge in the form of triplets. Our framework exhibits robust performance in handling temporal-based question answering tasks.
arXiv Detail & Related papers (2023-05-23T17:53:38Z)
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks. This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z)
From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models [111.42052290293965]
Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. We propose emphImg2Prompt, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections.
arXiv Detail & Related papers (2022-12-21T08:39:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.