Related papers: TextlessRAG: End-to-End Visual Document RAG by Speech Without Text

TextlessRAG: End-to-End Visual Document RAG by Speech Without Text

URL: http://arxiv.org/abs/2509.07538v2
Date: Wed, 10 Sep 2025 09:41:48 GMT
Title: TextlessRAG: End-to-End Visual Document RAG by Speech Without Text
Authors: Peijin Xie, Shun Qian, Bingquan Liu, Dexin Wang, Lin Sun, Xiangzheng Zhang,
Abstract summary: We propose TextlessRAG, the first end-to-end framework for speech-based question answering over large-scale document images.<n>Unlike prior methods, TextlessRAG eliminates ASR, TTS and OCR, directly interpreting speech, retrieving relevant visual knowledge, and generating answers in a fully textless pipeline.<n>We release the first bilingual speech--document RAG dataset, featuring Chinese and English voice queries paired with multimodal document content.
Score: 11.507219997350155
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Document images encapsulate a wealth of knowledge, while the portability of spoken queries enables broader and flexible application scenarios. Yet, no prior work has explored knowledge base question answering over visual document images with queries provided directly in speech. We propose TextlessRAG, the first end-to-end framework for speech-based question answering over large-scale document images. Unlike prior methods, TextlessRAG eliminates ASR, TTS and OCR, directly interpreting speech, retrieving relevant visual knowledge, and generating answers in a fully textless pipeline. To further boost performance, we integrate a layout-aware reranking mechanism to refine retrieval. Experiments demonstrate substantial improvements in both efficiency and accuracy. To advance research in this direction, we also release the first bilingual speech--document RAG dataset, featuring Chinese and English voice queries paired with multimodal document content. Both the dataset and our pipeline will be made available at repository:https://github.com/xiepeijinhit-hue/textlessrag

Related papers

An Effective Data Augmentation Method by Asking Questions about Scene Text Images [5.189562992500781]
We propose a VQA-inspired data augmentation framework that strengthens OCR training through structured question-answering tasks.<n>For each image-text pair, we generate natural-language questions probing character-level attributes such as presence, position, and frequency.<n>These auxiliary tasks encourage finer-grained reasoning, and the OCR model aligns visual features with textual queries to jointly reason over images and questions.
arXiv Detail & Related papers (2026-03-03T23:18:53Z)
VISTA-Bench: Do Vision-Language Models Really Understand Visualized Text as Well as Pure Text? [51.02924254085878]
Vision-Language Models (VLMs) have achieved impressive performance in cross-modal understanding across textual and visual inputs.<n>We introduce VISTA-Bench, a benchmark from multimodal perception, reasoning, to unimodal understanding domains.
arXiv Detail & Related papers (2026-02-04T17:48:55Z)
When Vision Meets Texts in Listwise Reranking [1.2691047660244335]
Rank-Nexus is a multimodal image-text document reranker that performs listwise qualitative reranking on retrieved lists incorporating both images and texts.<n>We first train modalities separately: leveraging abundant text reranking data, we distill knowledge into the text branch.<n>For images, where data is scarce, we construct distilled pairs from multimodal large language model (MLLM) captions on image retrieval benchmarks.
arXiv Detail & Related papers (2026-01-28T13:57:14Z)
CMRAG: Co-modality-based document retrieval and visual question answering [7.9679870806757185]
Retrieval-Augmented Generation (RAG) has become a core paradigm in document question answering tasks.<n>This paper proposes co-modality-based RAG, which can simultaneously leverage text and images for efficient retrieval and generation.<n> Experiments demonstrate that our method significantly outperforms pure-vision-based RAG in visual document question answering tasks.
arXiv Detail & Related papers (2025-09-02T09:17:57Z)
BRIT: Bidirectional Retrieval over Unified Image-Text Graph [0.0]
Retrieval-Augmented Generation (RAG) has emerged as a promising technique to enhance the quality and relevance of responses generated by large language models.<n>This paper proposes BRIT, a novel multi-modal RAG framework that unifies various text-image connections in the document into a multi-modal graph.<n>By traversing both image-to-text and text-to-image paths in the graph, BRIT retrieves not only directly query-relevant images and texts but also further relevant contents.
arXiv Detail & Related papers (2025-05-24T01:20:51Z)
Speech Retrieval-Augmented Generation without Automatic Speech Recognition [4.731446054087683]
SpeechRAG is a novel framework designed for open-question answering over spoken data.<n>Our proposed approach fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model.<n>By aligning the embedding spaces of text and speech, our speech retriever directly retrieves audio passages from text-based queries.
arXiv Detail & Related papers (2024-12-21T06:16:04Z)
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents [66.42579289213941]
Retrieval-augmented generation (RAG) is an effective technique that enables large language models to utilize external knowledge sources for generation.<n>We introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline.<n>In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.
arXiv Detail & Related papers (2024-10-14T15:04:18Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z)
Generate rather than Retrieve: Large Language Models are Strong Context Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators. We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z)
Text is NOT Enough: Integrating Visual Impressions intoOpen-domain Dialogue Generation [14.104415187890773]
Open-domain dialogue generation in natural language processing (NLP) is by default a pure-language task. hidden images, named as visual impressions (VIs), can be explored from the text-only data to enhance dialogue understanding. We propose a framework to explicitly construct VIs based on pure-language dialogue datasets.
arXiv Detail & Related papers (2021-09-13T08:57:13Z)
Abstractive Summarization of Spoken and Written Instructions with BERT [66.14755043607776]
We present the first application of the BERTSum model to conversational language. We generate abstractive summaries of narrated instructional videos across a wide variety of topics. We envision this integrated as a feature in intelligent virtual assistants, enabling them to summarize both written and spoken instructional content upon request.
arXiv Detail & Related papers (2020-08-21T20:59:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.