Related papers: Text is NOT Enough: Integrating Visual Impressions intoOpen-domain Dialogue Generation

Text is NOT Enough: Integrating Visual Impressions intoOpen-domain Dialogue Generation

URL: http://arxiv.org/abs/2109.05778v1
Date: Mon, 13 Sep 2021 08:57:13 GMT
Title: Text is NOT Enough: Integrating Visual Impressions intoOpen-domain Dialogue Generation
Authors: Lei Shen, Haolan Zhan, Xin Shen, Yonghao Song and Xiaofang Zhao
Abstract summary: Open-domain dialogue generation in natural language processing (NLP) is by default a pure-language task. hidden images, named as visual impressions (VIs), can be explored from the text-only data to enhance dialogue understanding. We propose a framework to explicitly construct VIs based on pure-language dialogue datasets.
Score: 14.104415187890773
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-domain dialogue generation in natural language processing (NLP) is by default a pure-language task, which aims to satisfy human need for daily communication on open-ended topics by producing related and informative responses. In this paper, we point out that hidden images, named as visual impressions (VIs), can be explored from the text-only data to enhance dialogue understanding and help generate better responses. Besides, the semantic dependency between an dialogue post and its response is complicated, e.g., few word alignments and some topic transitions. Therefore, the visual impressions of them are not shared, and it is more reasonable to integrate the response visual impressions (RVIs) into the decoder, rather than the post visual impressions (PVIs). However, both the response and its RVIs are not given directly in the test process. To handle the above issues, we propose a framework to explicitly construct VIs based on pure-language dialogue datasets and utilize them for better dialogue understanding and generation. Specifically, we obtain a group of images (PVIs) for each post based on a pre-trained word-image mapping model. These PVIs are used in a co-attention encoder to get a post representation with both visual and textual information. Since the RVIs are not provided directly during testing, we design a cascade decoder that consists of two sub-decoders. The first sub-decoder predicts the content words in response, and applies the word-image mapping model to get those RVIs. Then, the second sub-decoder generates the response based on the post and RVIs. Experimental results on two open-domain dialogue datasets show that our proposed approach achieves superior performance over competitive baselines.

Related papers

An Effective Data Augmentation Method by Asking Questions about Scene Text Images [5.189562992500781]
We propose a VQA-inspired data augmentation framework that strengthens OCR training through structured question-answering tasks.<n>For each image-text pair, we generate natural-language questions probing character-level attributes such as presence, position, and frequency.<n>These auxiliary tasks encourage finer-grained reasoning, and the OCR model aligns visual features with textual queries to jointly reason over images and questions.
arXiv Detail & Related papers (2026-03-03T23:18:53Z)
UniModel: A Visual-Only Framework for Unified Multimodal Understanding and Generation [51.31795451147935]
We present a unified generative model that supports visual understanding and visual generation within a single pixel-to-pixel diffusion framework.<n>Our goal is to achieve unification along three axes: the model, the tasks, and the representations.<n> Experiments on text-to-image synthesis and image-to-text understanding demonstrate strong cross-modal alignment.
arXiv Detail & Related papers (2025-11-21T03:02:10Z)
TextlessRAG: End-to-End Visual Document RAG by Speech Without Text [11.507219997350155]
We propose TextlessRAG, the first end-to-end framework for speech-based question answering over large-scale document images.<n>Unlike prior methods, TextlessRAG eliminates ASR, TTS and OCR, directly interpreting speech, retrieving relevant visual knowledge, and generating answers in a fully textless pipeline.<n>We release the first bilingual speech--document RAG dataset, featuring Chinese and English voice queries paired with multimodal document content.
arXiv Detail & Related papers (2025-09-09T09:16:25Z)
ConText: Driving In-context Learning for Text Removal and Segmentation [59.6299939669307]
This paper presents the first study on adapting the visual in-context learning paradigm to optical character recognition tasks.<n>We propose a task-chaining compositor in the form of image-removal-segmentation.<n>We also introduce context-aware aggregation, integrating the chained prompt pattern into the latent query representation.
arXiv Detail & Related papers (2025-06-04T10:06:32Z)
Decoder Pre-Training with only Text for Scene Text Recognition [54.93037783663204]
Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. We introduce a novel method named Decoder Pre-training with only text for STR (DPTR) DPTR treats text embeddings produced by the CLIP text encoder as pseudo visual embeddings and uses them to pre-train the decoder.
arXiv Detail & Related papers (2024-08-11T06:36:42Z)
VCR: Visual Caption Restoration [80.24176572093512]
We introduce Visual Caption Restoration (VCR), a vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images.
arXiv Detail & Related papers (2024-06-10T16:58:48Z)
Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training [33.51524424536508]
Iterative Prompt Relabeling (IPR) is a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling with feedback. We conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations.
arXiv Detail & Related papers (2023-12-23T11:10:43Z)
Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog [83.63849872250651]
Video-grounded dialog requires profound understanding of both dialog history and video content for accurate response generation. We present an iterative search and reasoning framework, which consists of a textual encoder, a visual encoder, and a generator.
arXiv Detail & Related papers (2023-10-11T07:37:13Z)
Advancing Visual Grounding with Scene Knowledge: Benchmark and Method [74.72663425217522]
Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Most existing VG datasets are constructed using simple description texts. We propose a novel benchmark of underlineScene underlineKnowledge-guided underlineVisual underlineGrounding.
arXiv Detail & Related papers (2023-07-21T13:06:02Z)
Top-Down Framework for Weakly-supervised Grounded Image Captioning [19.00510117145054]
Weakly-supervised grounded image captioning aims to generate the caption and ground (localize) predicted object words in the input image without using bounding box supervision. We propose a one-stage weakly-supervised grounded captioner that directly takes the RGB image as input to perform captioning and grounding at the top-down image level.
arXiv Detail & Related papers (2023-06-13T01:42:18Z)
Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description. Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect. In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z)
Locate Then Generate: Bridging Vision and Language with Bounding Box for Scene-Text VQA [15.74007067413724]
We propose a novel framework for Scene Text Visual Question Answering (STVQA) It requires models to read scene text in images for question answering.
arXiv Detail & Related papers (2023-04-04T07:46:40Z)
Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training [70.83385449872495]
The correlation between the vision and text is essential for video moment retrieval (VMR) Existing methods rely on separate pre-training feature extractors for visual and textual understanding. We propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments.
arXiv Detail & Related papers (2023-02-28T19:29:05Z)
SceneGATE: Scene-Graph based co-Attention networks for TExt visual question answering [2.8974040580489198]
The paper proposes a Scene Graph based co-Attention Network (SceneGATE) for TextVQA. It reveals the semantic relations among the objects, Optical Character Recognition (OCR) tokens and the question words. It is achieved by a TextVQA-based scene graph that discovers the underlying semantics of an image.
arXiv Detail & Related papers (2022-12-16T05:10:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.